Delta / Unity Catalog Connector¶
Overview¶
connectors/catalogs/delta/ implements a connector targeting Databricks Unity Catalog-powered Delta Lake
warehouses. It uses the Delta Kernel, Unity Catalog REST APIs, Databricks SQL endpoints, and AWS S3
(through the v2 client) to enumerate tables, collect statistics, and plan files.
The primary implementation is DeltaConnector (abstract) with source-specific subclasses for Unity
Catalog, AWS Glue, and filesystem-backed tables, exposed via DeltaConnectorProvider. Supporting classes manage
OAuth2 bearer token usage (including CLI, service principal, and WIF flows resolved upstream),
Databricks SQL execution, and custom file readers for S3.
Architecture & Responsibilities¶
DeltaConnector– AbstractFloecatConnectorthat centralizes snapshot/stat logic.UnityDeltaConnector– Unity Catalog-backed connector that:- Talks to Unity Catalog REST (
UcHttp) to list catalogs/schemas/tables. - Uses Delta Kernel (
io.delta.kernel.Table) for schema and snapshot access. - Executes Databricks SQL statements via
SqlStmtClientif a warehouse is configured. - Reads Parquet data with
S3V2FileSystemClientandParquetS3V2InputFilefor NDV/statistics. - Plans files using
DeltaPlanner, emittingScanFiles for data/delete manifests. DeltaFilesystemConnector– Single-table connector fordelta.table-rootplus optionalexternal.namespace/external.table-nameoverrides.DeltaGlueConnector– AWS Glue-backed connector that:- Lists databases and Delta-registered tables from Glue.
- Resolves table
storage_locationfrom Glue metadata and reads table snapshots via Delta Kernel. DeltaConnectorFactory– Selects Unity, Glue, or filesystem sources and wires engine/auth/IO.UcBaseSupport/UcHttp– HTTP helpers for constructing API URLs, encoding parameters, and handling retries/timeouts.DeltaTypeMapper– Maps Delta/Parquet logical types into Floecat logical types for stats.
Public API / Surface Area¶
DeltaConnector and subclasses implement the SPI methods:
listNamespaces()– Fetches catalogs via/api/2.1/unity-catalog/catalogs, then enumerates schemas per catalog, returningcatalog.schemapairs.listTables(namespace)– Calls/api/2.1/unity-catalog/tablesfiltered by catalog/schema, then filters todata_source_format == DELTA.describe(namespace, table)– Fetches table metadata from Unity Catalog, reads the Delta schema via Delta Kernel, and returns aTableDescriptorcontaining location, partition keys, and properties.enumerateSnapshots(...)– Iterates Delta snapshots and emitsSnapshotBundles for snapshot lineage/metadata. In incremental mode, the connector enumerates all Delta versions that Floecat has not already ingested. WhenSnapshotEnumerationOptions.targetSnapshotIdsis supplied, enumeration is limited to that explicit version set even whenfullRescan=true.captureSnapshotTargetStats(...)– Captures table/column/file stats for one snapshot and optional selector scope, optionally sampling Parquet files for NDV (SamplingNdvProvider,ParquetNdvProvider).
Important Internal Details¶
- Authentication – Uses an OAuth2 bearer token supplied in the resolved connector config or the Databricks CLI cache. Token exchange and secret handling happen earlier in the service layer, except for CLI cache refresh which is handled in the connector.
- HTTP & SQL clients –
UcHttpcentralises base URI, connect/read timeouts, and error mapping.SqlStmtClientoptionally executes SQL statements (for example to inspect statistics tables) via Databricks SQL warehouses. - S3 integration – Uses AWS SDK v2 (
S3Client) with region from connector properties to read data files.S3RangeReaderprovides efficient range reads for Parquet file access. - NDV sampling – Controlled by
stats.ndv.enabled,stats.ndv.sample_fraction, andstats.ndv.max_files. Samples combine streaming NDV with Parquet footers for accuracy. - Type mapping –
DeltaTypeMapperensures nested Delta/Parquet types are faithfully represented when computing stats, aligning withtypes/definitions. - Constraint mapping – Snapshot constraints currently emit metadata that is reliably exposed by Delta snapshots/table metadata:
CT_NOT_NULLfrom non-nullable schema fields (including nested struct leaves).CT_CHECKfrom table properties usingdelta.constraints.<name>=<sql_expression>.CT_PRIMARY_KEY,CT_FOREIGN_KEY, andCT_UNIQUEare not emitted from core Delta metadata because no portable source is defined for them.- Source-specific extraction path:
- Unity Catalog: merge of snapshot metadata + UC table properties from
/api/2.1/unity-catalog/tables/{full_name}. Snapshot metadata wins on key collisions. - Glue: merge of snapshot metadata + Glue table parameters. Snapshot metadata wins on key collisions.
- Filesystem: snapshot metadata only (no catalog-level fallback source).
- Unity Catalog: merge of snapshot metadata + UC table properties from
- Connector matrix (current behavior):
- Unity:
CT_NOT_NULL,CT_CHECK(delta.constraints.*) from merged snapshot + UC metadata. - Glue:
CT_NOT_NULL,CT_CHECK(delta.constraints.*) from merged snapshot + Glue metadata. - Filesystem:
CT_NOT_NULL,CT_CHECK(delta.constraints.*) from snapshot metadata only.
- Unity:
Data Flow & Lifecycle¶
ConnectorFactory.create(cfg)
→ DeltaConnectorFactory.create(uri, options, authProvider)
→ Select Unity vs filesystem source
→ Instantiate S3 client + Delta Kernel engine
→ Configure Unity Catalog HTTP and optional SQL client when needed
→ listNamespaces/listTables via Unity Catalog REST
→ describe via REST + Delta Kernel schema inspection
→ enumerateSnapshots
→ Delta Kernel snapshot lineage
→ captureSnapshotTargetStats
→ Delta Kernel Snapshot → Parquet stats engine → TargetStatsRecord (table/column/file stats)
→ plan
→ DeltaPlanner traverses _delta_log → ScanBundle entries for data/delete files
Connector resources (HTTP clients, S3 client, Delta engine) are closed when close() is invoked.
Configuration & Extensibility¶
Important connector properties:
delta.source– Selects backend (unity,glue,filesystem). Defaults tounity.delta.table-root– Required fordelta.source=filesystem, pointing at a Delta table root.external.namespace,external.table-name– Optional overrides for filesystem connector naming.http.connect.ms,http.read.ms– Timeout controls for Unity Catalog HTTP calls.databricks.sql.warehouse_id– Enables SQL statement execution when set.s3.region/aws.region– Region for the S3 client used to read Parquet files.stats.ndv.*– Sampling knobs identical to the Iceberg connector.- Authentication-specific options (
auth.scheme,auth.properties) –auth.scheme=oauth2expects eithertoken=<access-token>oroauth.mode=cli(to read the Databricks CLI cache). Service principal and WIF are expressed asAuthCredentialsand resolved upstream. Fordelta.source=glue, use AWS credentials/profile options (for example resolveds3.*keys orauth.propertiesprofile settings) and setauth.scheme=aws-sigv4ornone.
Auth credential types (--cred-type) are documented in docs/cli-reference.md.
For Delta, the relevant types are bearer, client (SP), cli, token-exchange (WIF),
token-exchange-entra, and token-exchange-gcp. Entra/GCP exchanges only work if the Databricks
workspace is configured to trust those IdPs.
Use the Databricks workspace host for uri (for example https://dbc-<workspace-id>.cloud.databricks.com);
token exchange endpoints use https://<workspace-host>/oidc/v1/token.
Extensibility points:
- Implement new auth schemes by extending
AuthProviderand wiring them in the connector provider. - Plug in additional NDV providers if Delta tables store custom sketches.
- Extend
DeltaPlannerto emit additional metadata (for example z-order hints) when the upstream API exposes them.
Examples & Scenarios¶
- Connector Spec – A Unity Catalog connector might specify:
{
"display_name":"delta-unity",
"kind":"CK_DELTA",
"uri":"https://dbc-1234.cloud.databricks.com",
"properties":{
"databricks.sql.warehouse_id":"abcd",
"s3.region":"us-west-2",
"stats.ndv.enabled":"true"
},
"auth":{
"scheme":"oauth2",
"credentials":{"bearer":{"token":"<access-token>"}},
"properties":{}
}
}
- CLI examples
-
Service principal (SP) – Use
clientcredentials. Resolve via client credentials exchange (service layer), connector sees a bearer token. Token endpoint is the workspace OIDC URL:https://<workspace-host>/oidc/v1/token.connector create "Unity Delta SP" DELTA https://dbc-d382c535-b2a9.cloud.databricks.com \ "cusack.ext_tpcds" tpcds --dest-ns federated --source-table store_sales \ --auth-scheme oauth2 \ --cred-type client \ --cred endpoint=https://dbc-d382c535-b2a9.cloud.databricks.com/oidc/v1/token \ --cred client_id=3d9b2f0f-7f1a-4b6e-9f0a-2f1b6c9a1234 \ --cred client_secret=ddbsp-9f1c2a3b4c5d6e7f8a9b \ --auth scope=all-apis -
WIF (token exchange) – Use
token-exchange. Resolve via RFC 8693 exchange (service layer), connector sees a bearer token. Token endpoint is the workspace OIDC URL:https://<workspace-host>/oidc/v1/token.connector create "Unity Delta WIF" DELTA https://dbc-d382c535-b2a9.cloud.databricks.com \ "cusack.ext_tpcds" tpcds --dest-ns federated --source-table store_sales \ --auth-scheme oauth2 \ --cred-type token-exchange \ --cred endpoint=https://dbc-d382c535-b2a9.cloud.databricks.com/oidc/v1/token \ --cred client_id=3d9b2f0f-7f1a-4b6e-9f0a-2f1b6c9a1234 \ --cred client_secret=ddbsp-9f1c2a3b4c5d6e7f8a9b \ --cred subject_token_type=urn:ietf:params:oauth:token-type:jwt \ --cred requested_token_type=urn:ietf:params:oauth:token-type:access_token \ --cred scope="all-apis offline_access" -
CLI cache – Connector reads the Databricks CLI cache directly:
connector create "Unity Delta CLI" DELTA https://dbc-d382c535-b2a9.cloud.databricks.com \ "cusack.ext_tpcds" tpcds --dest-ns federated --source-table store_sales \ --auth-scheme oauth2 \ --cred-type cli \ --cred cache_path=~/.databricks/token-cache.json -
Bearer token (PAT) – Using the
connectorCLI with a resolved token or Databricks personal access token:
connector create "Unity Delta Token" DELTA https://dbc-d382c535-b2a9.cloud.databricks.com \
"cusack.ext_tpcds" tpcds --dest-ns federated --source-table store_sales \
--auth-scheme oauth2 --auth token=<access-token>
- Full reconciliation –
ReconcilerServiceenters full-rescan mode (fullRescan=true), so the connector lists every table in the namespace, creates missing namespaces in the destination catalog, updatesDestinationTargetpointers, and ingests snapshot stats for each table.
Cross-References¶
- SPI details:
docs/connectors-spi.md - Iceberg connector for contrast:
docs/connectors-iceberg.md - Service & reconciler integration:
docs/service.md,docs/reconciler.md