Load Bearing

why read code?

Because sometimes AI write bad code. It is unbecoming, but bad code is bad! I want to document this because, whenever people says "AI still can't write great code without handholding", it is usually declared vehement without example. That just ain't convincing. People deserve examples.

Practical Exhibition 1:

# by claude
 79 +pub fn parquet_event_time_bounds(parquet_path: &Path) -> Result<Option<(i64, i64)>, TarballError> {
       80 +    use parquet::file::reader::{FileReader, SerializedFileReader};
       81 +    use parquet::file::statistics::Statistics;
       82 +    let file = File::open(parquet_path)?;
       83 +    let reader = SerializedFileReader::new(file)?;
       84 +    let meta = reader.metadata();
       85 +
       86 +    let schema = meta.file_metadata().schema_descr();
       87 +    let Some(col_idx) =
       88 +        (0..schema.num_columns()).find(|&i| schema.column(i).name() == "_event_time")
       89 +    else {
       90 +        return Ok(None);
       91 +    };
       92 +
       93 +    let mut bounds: Option<(i64, i64)> = None;
       94 +    for rg in meta.row_groups() {
       95 +        // `_event_time` is normalized to int64 micros at ingest; any other
       96 +        // physical type means the column isn't what we expect, so skip it.
       97 +        if let Some(Statistics::Int64(s)) = rg.column(col_idx).statistics() {
       98 +            if let (Some(&mn), Some(&mx)) = (s.min_opt(), s.max_opt()) {
       99 +                bounds = Some(match bounds {
      100 +                    None => (mn, mx),
      101 +                    Some((lo, hi)) => (lo.min(mn), hi.max(mx)),
      102 +                });
      103 +            }
      104 +        }
      105 +    }
      106 +    Ok(bounds)
      107 +}

This seems perfectly fine. Get the bounds from the column meta for each row group. Replace the existing bounds if the new low is lower, or/and the new high is higher.

The thing is, poor import location not withstanding, event time in parquet can be many things. I64 is the underlying type, true. But it doesn't tell us what unit it is in. Is it micros from epoch? Millis? Hotdogs? In addition, parquet cols for time also include logical type informations about whether the time information is UTC adjusted or local to some timezone!

So, fine. Claude is steadfast eternal. This is what it has now:

       96 +fn parse_event_instant(
       97 +    raw: i64,
       98 +    logical: Option<parquet::basic::LogicalType>,
       99 +) -> Option<DateTime<Utc>> {
      100 +    use parquet::basic::{LogicalType, TimeUnit};
      101 +    match logical {
      102 +        Some(LogicalType::Timestamp {
      103 +            is_adjusted_to_utc,
      104 +            unit,
      105 +        }) => {
      106 +            if !is_adjusted_to_utc {
      107 +                // Contract: `_event_time` is a UTC-adjusted timestamp
      108 +                // (DuckDB TIMESTAMPTZ). A non-adjusted column is a wall clock
      109 +                // with no zone — refuse to invent one rather than emit a bound
      110 +                // off by an unknown offset.
      111 +                return None;
      112 +            }
      113 +            match unit {
      114 +                TimeUnit::MILLIS(_) => DateTime::from_timestamp_millis(raw),
      115 +                TimeUnit::MICROS(_) => DateTime::from_timestamp_micros(raw),
      116 +                TimeUnit::NANOS(_) => Some(DateTime::from_timestamp_nanos(raw)),
      117 +            }
      118 +        }
      119 +        // No declared logical type: fall back to the normalize-at-ingest
      120 +        // convention (int64 micros UTC), matching the /window reader. This is
      121 +        // the ONLY micros assumption, and only when the parquet declares nothing.
      122 +        None => DateTime::from_timestamp_micros(raw),
      123 +        // Any other logical type isn't an instant we can interpret.
      124 +        Some(_) => None,
      125 +    }
      126 +}
# IMAGINE THAT THE PREVIOUS CODE WAS FIXED AND THE LOGICAL TYPES EXTRACTED FOR THE COL

Still, this aint quite right. Especially the parameter part that says logical: Option<parquet::basic::LogicalType>. Why would you hand it a Option LogicalType? Why not just the TimeUnit variant of the LogicalType? If we were still unsure that we could even get our hands on the logical type, we should have bailed out a lot earlier!

Anyhow, there is that.