Type System Utilities¶
Overview¶
The core/types/ module implements Floecat's canonical logical type system. It bridges the logical
types declared in protobuf (types/types.proto) with Java helpers used by connectors, schema
mappers, statistics engines, and the execution scan bundle assembler.
Key classes: LogicalType, LogicalKind, LogicalTypeProtoAdapter, LogicalComparators,
LogicalCoercions, ValueEncoders, and MinMaxCodec.
Canonical Type Kinds¶
LogicalKind defines the complete set of canonical types shared across all table formats (Iceberg,
Delta, etc.) and SQL-facing components.
| Canonical Kind | Proto field number | Description |
|---|---|---|
BOOLEAN |
1 | Boolean true/false |
INT |
2 | 64-bit signed integer (all source sizes collapse) |
FLOAT |
4 | 32-bit IEEE-754 single-precision float |
DOUBLE |
5 | 64-bit IEEE-754 double-precision float |
DATE |
6 | Calendar date (no time, no timezone) |
TIME |
7 | Time of day (no date, no timezone) |
TIMESTAMP |
8 | Timezone-naive timestamp (local time) |
TIMESTAMPTZ |
13 | UTC-normalised timestamp |
STRING |
9 | UTF-8 text |
BINARY |
10 | Arbitrary byte sequence |
UUID |
11 | 128-bit universally unique identifier |
DECIMAL |
12 | Fixed-precision decimal (precision, scale) |
INTERVAL |
14 | Duration / period |
JSON |
15 | Semi-structured JSON text |
ARRAY |
16 | Ordered collection (non-parameterised in v1) |
MAP |
17 | Key-value map (non-parameterised in v1) |
STRUCT |
18 | Named-field record (non-parameterised in v1) |
VARIANT |
19 | Schema-flexible semi-structured value |
Integer collapsing¶
Every source-format integer size (TINYINT, SMALLINT, INT, INTEGER, BIGINT, LONG, INT8, INT4, INT2,
UINT8, UINT4, UINT2) collapses to canonical INT (64-bit signed). Source alias names can be
resolved via LogicalKind.fromName(String).
Timestamp semantics¶
TIMESTAMP— stores local time without UTC normalisation (IcebergwithoutZone(), Deltatimestamp_ntz).TIMESTAMPTZ— stores microseconds-since-epoch UTC (IcebergwithZone(), Deltatimestamp).
Note: The Floe spec decode matrix v1 has these two entries inverted. The implementation applies the semantically correct mapping and records the discrepancy in code comments.
Interval ranges¶
INTERVAL remains a single logical kind with an optional range:
- INTERVAL YEAR TO MONTH → range YEAR_TO_MONTH
- INTERVAL DAY TO SECOND → range DAY_TO_SECOND
- Plain INTERVAL → range UNSPECIFIED
Stats encoding (when present) uses ISO‑8601 duration strings. Engine‑native interval layouts are carried via overlays/hints, not Floecat core types.
Interval precisions follow ANSI SQL conventions:
- INTERVAL YEAR(p) TO MONTH → interval_leading_precision = p
- INTERVAL DAY(p) TO SECOND(s) → interval_leading_precision = p,
interval_fractional_precision = s
- INTERVAL(s) → normalized to DAY_TO_SECOND with interval_fractional_precision = s
Leading precision is non‑negative and connector‑defined; fractional precision is limited to 0..6 (microsecond scale) in Floecat encoders.
Precision + parsing¶
- Canonical
TIME,TIMESTAMP, andTIMESTAMPTZare microsecond precision. Inputs with higher precision are truncated to micros for stats encoding/comparison. - Numeric encodings are only accepted for
DATE(epoch days).TIME,TIMESTAMP, andTIMESTAMPTZrequire typed values or ISO‑8601 strings; numeric heuristics are not used. - Temporal precisions can be carried in the logical type string (e.g.
TIME(3),TIMESTAMP(6),TIMESTAMPTZ(0), range 0..6). When present, encoders truncate and emit exactly that many fractional digits. When absent, Floecat defaults to microsecond precision and ISO‑8601 formatting (no fixed width). TIMESTAMPexpects timezone‑naive inputs (noZor offset). By default, zoned strings are rejected. You can opt into conversion by setting:floecat.timestamp_no_tz.policy=CONVERT_TO_SESSION_ZONE(or envFLOECAT_TIMESTAMP_NO_TZ_POLICY)floecat.session.timezone=<IANA zone>(or envFLOECAT_SESSION_TIMEZONE) When enabled, zoned timestamps are converted into that session zone and stored as localTIMESTAMPvalues.
Complex types (v1)¶
ARRAY, MAP, STRUCT, and VARIANT are non-parameterised in v1. The logical kind captures
only the container category; element/value/field types are captured by child SchemaColumn rows
carrying their own paths (e.g. address.city, items[], tags{}).
Architecture & Responsibilities¶
LogicalType/LogicalKind– Immutable representations of logical types.LogicalTypestores(kind, precision, scale, temporalPrecision, intervalRange, intervalLeadingPrecision, intervalFractionalPrecision).temporalPrecisionis optional (unset means default microsecond precision). Interval fields are optional and only apply toINTERVAL. CanonicalDECIMALsemantics areprecision ≥ 1and0 ≤ scale ≤ precisionwith no global precision ceiling in the core model. Connector-specific constraints apply (for example Iceberg/Delta cap precision at 38, while other sources may allow larger values). TIME/TIMESTAMP/TIMESTAMPTZ may carry a fractional‑second precision (0..6). All other kinds reject parameters.LogicalTypeProtoAdapter– Converts between the protobufai.floedb.floecat.types.LogicalTypewire message and the JVMLogicalType, preserving kind/precision/scale/interval range metadata.LogicalCoercions– Coerces raw stat values to the canonical Java type for a given kind (e.g. anyNumber→LongforINT, string →LocalDateTimeforTIMESTAMP(timezone‑naive policy), string →InstantforTIMESTAMPTZ).LogicalComparators– ProvidesComparatorinstances for ordering values encoded as strings or byte buffers (used when building column stats).ValueEncoders/MinMaxCodec– Encode scalar values into canonical strings/bytes, enabling deterministic min/max statistics across connectors.
Type-Family Helpers¶
LogicalType exposes four predicate helpers for grouping kinds:
// Scalar numeric: INT, FLOAT, DOUBLE, DECIMAL
boolean isNumeric()
// Temporal: DATE, TIME, TIMESTAMP, TIMESTAMPTZ, INTERVAL
boolean isTemporal()
// Container: ARRAY, MAP, STRUCT, VARIANT
boolean isComplex()
// Everything that is not a container (alias for !isComplex())
boolean isScalar()
Example:
LogicalType t = LogicalType.decimal(10, 2);
t.isNumeric(); // true
t.isDecimal(); // true
t.isTemporal(); // false
t.isComplex(); // false
t.isScalar(); // true
Public API / Surface Area¶
Most classes expose static helpers:
LogicalType t = LogicalType.decimal(38, 4);
String logicalType = LogicalTypeFormat.format(t);
String encoded = MinMaxCodec.encode(t, BigDecimal.valueOf(42));
LogicalTypeProtoAdapter.decodeLogicalType(String logicalType) and
.encodeLogicalType(LogicalType logicalType) convert between canonical logical type strings and
runtime objects.
Arrow Mapping Contract¶
core/arrow helpers (especially ArrowSchemaUtil) are defined over Floecat logical types, not over
arbitrary engine-native type systems.
- Input is
SchemaColumn.logical_typeand should be a Floecat canonical logical type string (or an accepted alias handled byLogicalKind.fromNamesemantics). - Integer aliases (
TINYINT,SMALLINT,INT,BIGINT,INT2/4/8,UINT2/4/8) all map to Arrow signed 64-bit (Int64) to preserve collapsed canonicalINTbehavior. - Unknown, null, or blank logical types fail fast with
IllegalArgumentException; they are not silently coerced toUtf8. JSONmaps to ArrowUtf8.UUIDmaps to ArrowFixedSizeBinary(16);BINARYmaps to ArrowBinary.DECIMALmaps to ArrowDecimal128when precision ≤ 38, andDecimal256when precision ≤ 76. Precision > 76 is rejected byArrowSchemaUtil.TIMEmaps to ArrowTime(MICROSECOND, 64),TIMESTAMPtoTimestamp(MICROSECOND, null), andTIMESTAMPTZtoTimestamp(MICROSECOND, "UTC").INTERVALand complex container types (ARRAY,MAP,STRUCT,VARIANT) are not supported in Arrow schema generation; they must be omitted or cast toSTRING/BINARY.
If external Flight providers want to reuse core/arrow, they should first map their source type
surface into Floecat logical types.
Source-Format Alias Lookup¶
LogicalKind.fromName(String candidate) resolves source-format type names to canonical kinds:
LogicalKind.fromName("bigint") // → INT
LogicalKind.fromName("float4") // → FLOAT
LogicalKind.fromName("double precision") // → DOUBLE
LogicalKind.fromName("timestamp with time zone") // → TIMESTAMPTZ
LogicalKind.fromName("ARRAY") // → ARRAY
The lookup is case-insensitive and collapses internal whitespace. Unknown names throw
IllegalArgumentException.
Important Internal Details¶
- Validation –
LogicalTypeconstructor enforces: forDECIMAL,precision ≥ 1and0 ≤ scale ≤ precision. There is no global DECIMAL precision cap in the core model; connectors enforce their own ceilings (for example Iceberg/Delta cap at 38) at schema-parse time. Non-decimal kinds reject precision/scale altogether. - Non-stats-orderable types –
INTERVAL,JSON, and complex kinds (ARRAY,MAP,STRUCT,VARIANT) have no meaningful min/max statistics.LogicalComparators.normalize()returnsnullandValueEncoders.encodeToStringthrows for JSON/complex kinds, so connectors should leave bounds unset.INTERVALencodings can be stored but are ignored by stats comparisons. - Comparators –
LogicalComparatorsprovides specialised comparators for lexical ordering of encoded min/max values so histogram builders can operate on encoded strings. - Encoders –
ValueEncodersnormalises values before storing them in stats to guarantee consistent lexical ordering across connectors.
Data Flow & Lifecycle¶
Connector reads Parquet/Delta/Iceberg schema
→ schema mapper emits canonical LogicalKind strings (e.g. "INT", "TIMESTAMPTZ", "ARRAY")
→ SchemaColumn.logical_type stores the canonical string
→ ValueEncoders encode per-column min/max/ndv bounds
→ StatsRepository stores encoded values (string or bytes)
→ Query lifecycle service converts stored logical type IDs into planner TypeSpecs via TypeRegistry
Configuration & Extensibility¶
The module is pure Java; no configuration is required. Extending the type system involves:
1. Adding a new LogicalKind enum entry and proto Kind field number.
2. Registering source-format aliases in the ALIASES map inside LogicalKind.
3. Updating LogicalCoercions, LogicalComparators, and ValueEncoders for the new kind.
4. Updating connector schema mappers (IcebergSchemaMapper, DeltaSchemaMapper) to emit the new
canonical name.
5. Ensuring downstream consumers (FloeTypeMapper, DeltaManifestMaterializer) handle the new
canonical name.
Examples & Scenarios¶
- Iceberg schema parsing –
IcebergSchemaMapper.toCanonical(Type)converts Iceberg types to canonical strings (e.g.TimestampType.withZone()→"TIMESTAMPTZ"), storing them inSchemaColumn.logical_type. This avoids the historic ambiguity where"timestamp"meant UTC-stored in Delta but non-UTC in Iceberg. IcebergTimestampNanoTypeis mapped with the same timezone semantics (withZone()→"TIMESTAMPTZ",withoutZone()→"TIMESTAMP"), and IcebergVariantTypemaps to canonical"VARIANT". - Delta schema parsing –
DeltaSchemaMapper.deltaTypeToCanonical(JsonNode)applies Delta- specific semantics:"timestamp"→"TIMESTAMPTZ"(UTC-stored),"timestamp_ntz"→"TIMESTAMP"(timezone-naive). - Statistics ingestion – NDV providers convert Parquet min/max values using
MinMaxCodecbefore storing them inScalarStats, ensuring planners can compare them without deserialising actual binary payloads. Connector planners canonicalize connector-native numeric temporal bounds to typed values before generic coercion (for example Iceberg TIME micros-of-day, TIMESTAMP micros, and TIMESTAMP_NANO nanos).
Cross-References¶
- Protobuf type definitions:
docs/proto.md - Query lifecycle service:
docs/service.md