The Federated Governance Gap No Data Catalog Has Solved Until Now

June 24, 2026

10 minute

Your team set up row-level filters in Spark last quarter. Analysts in the Americas can't see European customer rows. It's a basic GDPR safeguard, and the security team signed off.

Last week, the analytics group started running interactive queries against the same dataset through Trino. The filters don't apply. An Americas analyst can pull every European row through Trino without tripping any of the controls. The policy lives in the Spark catalog. Trino reads from a different catalog and has never heard of it.

This is what federated governance failure actually looks like. It is also what Apache Gravitino was designed to prevent.

What Federated Data Governance Is Actually Trying to Solve

Federated governance solves a structural problem that becomes visible the moment your data architecture spans more than one engine. The model distributes data ownership to the domain teams that produce and use the data, while keeping policy enforcement centralized and consistent. A finance team owns and operates its sales data; a security team defines and enforces who can access it. Both work simultaneously because the model separates the two concerns.

Multi-engine data governance becomes architecturally necessary in enterprise environments because centralized governance breaks down. Centralized models assume a single team controls both data and policy, which works in a single warehouse with a single engine. As enterprises add Spark for batch processing, Trino for interactive analytics, Flink for streaming, and AI workloads with their own data access patterns, the centralized team becomes a bottleneck. Data engineering teams wait on governance reviews. Governance teams cannot keep up with the variety of engine-specific access patterns. The single-team model collapses under its own weight.

The federated alternative requires something neither model fully provides: a binding layer that lets distributed data ownership and centralized policy enforcement work together across engines. That binding layer is the catalog. Catalog metadata defines what data exists, where it lives, what its schema looks like, and how it relates to other datasets.

Policy bindings on that metadata define who can access it, under what conditions, with what masking or filtering applied, and through which engines. When the catalog travels with the data across engines, the policy bindings travel with it. When the catalog stays behind, federated governance becomes federated risk.

What an Apache Gravitino-Style Catalog Actually Requires

Most federated governance data catalog implementations break at engine boundaries. A federated data catalog must maintain metadata and policy continuity across every engine that touches the data, whereas most enterprise catalogs cannot.

The engine boundary problem shows up everywhere data moves between processing engines. Spark writes a dataset to S3 with row-level policy bindings registered in the Spark catalog. Trino queries the same dataset and sees none of those bindings, because it reads from a different catalog. Flink, ML platforms, federated query layers, and replication pipelines each maintain their own catalogs with their own metadata models. Each engine has its own view of what the data is and who can access it. Where the views diverge is exactly where compliance breaks.

The consequence is concrete: policy enforced in one engine becomes invisible in another. The Spark catalog knows about a column-level mask that hides social security numbers from analysts; the Trino catalog does not, so an analyst querying the same dataset through Trino sees the raw column. The gaps are invisible to the governance team because the metadata that would expose them lives in different places.

Multi-engine data flow	Where catalog metadata lives	Where governance policy applies	Where continuity breaks
Spark batch writes, Trino interactive queries	Spark catalog (Hive Metastore or proprietary)	Spark-side row and column policies	Trino reads from its own catalog; Spark policies invisible
Trino federated queries, Iceberg storage	Trino catalog plus Iceberg metadata	Trino access control only	Other engines reading Iceberg see no Trino policy bindings
Flink streaming, Spark batch reconciliation	Separate catalogs per engine	Engine-specific enforcement	Stream-batch joins lose lineage and policy continuity
Cross-cloud replication for high availability	Each region maintains its own catalog	Region-specific policy enforcement	Replicas in other regions inherit no policy bindings
AI training pipelines reading from the data lake	ML platform catalog	Training-side access control	Lake-side governance does not propagate into training

Open table formats like Iceberg and Delta are designed for multi-engine table format interoperability. They let multiple engines read and write the same data. The format alone does not solve federated governance, though. Metadata about policy bindings, lineage, access controls, and audit trails lives one layer above the table format, in the catalog. When catalog continuity is missing, even the most interoperable table format cannot enforce consistent policy across engines.

What a Catalog That Travels With Data Actually Requires

Apache Gravitino is the open-source catalog architecture that solves the engine-boundary problem by design. The project provides a unified metadata layer that sits above individual engine catalogs and exposes a single API for metadata operations across Spark, Trino, Flink, and other compute engines. Schema definitions, lineage information, policy bindings, and quality metrics registered in Gravitino are visible to every engine that connects through the Gravitino API.

The Gravitino catalog model handles four requirements that engine-specific catalogs cannot. Metadata operations work the same across engines, so a table registered in Spark is queryable in Trino with the same schema and policy bindings. Lineage tracks across engine transitions, connecting datasets produced and consumed by different engines. Policy bindings persist as data moves, enforcing the same rules wherever the data is read. Schema evolution propagates uniformly across engines instead of diverging across engine-specific catalogs.

The Gravitino open catalog model contrasts with proprietary catalog implementations tightly coupled to a single managed platform. Proprietary catalogs work inside the platform that owns them and break at every engine boundary outside it. The lock-in pattern repeats every time enterprise teams choose a proprietary catalog: metadata becomes platform-dependent, policy enforcement applies only to the platform's engines, migration off the platform requires re-deriving the catalog from scratch, and other engines cannot read the catalog directly.

OpenMetadata is another open-source data catalog governance option with similar design goals. OpenMetadata focuses more on metadata discovery, and Gravitino focuses more on multi-engine metadata operations, but both reflect a broader shift in the open-source ecosystem toward catalogs designed for engine-agnostic governance. The shift is increasingly the baseline for the best data catalog for enterprise data governance in 2026 conversations.

Acceldata xLake, the Kubernetes-native data platform in the x-Lake family, builds xGovern's federated catalog on Apache Gravitino. The catalog maintains metadata, schema, lineage, and policy continuity across every engine xLake supports: Spark, Trino, Airflow, and the GPU-accelerated runtimes that AI workloads use. Acceldata's data discovery capability exposes the Gravitino metadata model through a unified discovery interface for data teams.

The Role of the Catalog in AI Governance

The role of the data catalog in AI governance is the same role it plays in traditional governance, with sharper consequences. AI governance requirements look familiar at the surface: data lineage for model training, access control on training datasets, data residency compliance for regulated data used in models, and reproducibility tracking for audit and validation. Each requirement depends on catalog metadata that follows the data from source to model and back.

What breaks when the catalog stays behind from AI training is concrete and frequently invisible. Lineage gaps appear when training data moves from the data lake catalog into an ML platform with its own catalog. The model registry knows what dataset version trained which model, but the source-side catalog does not know which datasets were used for training. Untracked data access happens when an ML platform reads training data with platform-level credentials that bypass the lake-side policy bindings.

Compliance violations stay invisible until an external audit reconstructs the lineage manually. Reproducibility breaks when training environments cannot reconstruct the catalog state used for the original model run.

A unified multi-engine catalog provides the foundation for both operational data governance and AI governance simultaneously. Training datasets register in the same catalog as the source data, with the same policy bindings carried forward. Model lineage links into dataset lineage, so the audit question reconstructs itself from catalog data.

Access controls applied to training datasets enforce consistently whether the data is read by a Spark batch job, a Trino query, a model training pipeline, or a streaming inference workload.

How Apache Ranger Completes the Governance Architecture

Data catalog governance has two distinct components that get conflated when teams talk about "governance." A complete data governance catalog architecture treats both as separate layers that work together. The catalog answers what data exists, where it lives, what its schema is, and how it relates to other datasets. Policy enforcement answers who can read this data, under what conditions, with what masking applied, and through which engine. Both need to share the same underlying metadata model for the architecture to work.

Apache Ranger is the open-source policy enforcement layer that operates on top of catalog metadata. The project provides fine-grained access control, row-level security, column masking, and audit logging across Spark, Trino, Hive, HBase, and Iceberg. A Ranger policy attached to a table or column in the catalog enforces consistently across every engine that respects the policy, which is the second half of what "a catalog that travels with the data" actually delivers.

The catalog and policy enforcement layers work together by sharing metadata. The catalog says, "This column contains PII and lives in this table." The policy attached to that column defines access rules: who sees the raw values, who sees masked values, what conditions apply, and what gets logged for audit. When the catalog travels with the data and Ranger policies attach to catalog entities instead of engine-specific objects, the enforcement travels with the data, too.

Acceldata xLake's xGovern component provides unified governance through Apache Ranger and Apache Gravitino, connecting catalog metadata directly to policy enforcement across every engine xLake supports. The catalog layer tracks what data exists and where; the governance layer defines and enforces who can access it. Both operate on the same metadata model within a single component, which makes federated governance operationally feasible. The policy capability exposes this combined layer through a unified interface for governance teams.

Federated Governance Without a Traveling Catalog Is Just Federated Risk

Federated governance fails at engine boundaries when the catalog stays behind. The model works architecturally only when metadata, schema, lineage, and policy bindings persist as data moves between Spark, Trino, Flink, and the AI runtimes enterprises actually use in production. A complete federated governance architecture requires two things: an open, multi-engine catalog that maintains continuity across engines, and a policy enforcement layer that applies consistently wherever the data lives.

Acceldata xLake delivers both through xGovern. Built on Apache Gravitino and Apache Ranger, xGovern carries metadata across engine boundaries and enforces access policy wherever the data lands—eliminating the handoff between separate catalog and governance tools where engine-boundary risk typically hides.

See how xLake's federated governance architecture works in practice. Book a demo today!

Federated Data Governance and Catalog: Frequently Asked Questions

What is federated data governance and how does it differ from centralized governance?

Federated governance distributes data ownership to domain teams. Finance owns sales data, marketing owns campaign data—while keeping policy enforcement centralized for consistency. Centralized governance puts both under one team, which becomes a bottleneck as engines, clouds, and domains multiply.

What is Apache Gravitino and what problem does it solve?

Apache Gravitino is an open-source, multi-engine metadata catalog. It solves catalog continuity: traditionally, each engine keeps its own catalog, so schema, lineage, and policy bindings vanish at engine boundaries. Gravitino sits above engine catalogs and exposes one metadata API.

Why do most data catalogs fail to support federated governance?

Most catalogs are coupled to a single platform or engine—they work natively but lose metadata and policy bindings the moment data crosses boundaries. Federated governance needs an engine-agnostic catalog by design, which is what Apache Gravitino provides.

What is the difference between a data catalog and data governance?

A catalog manages metadata: what data exists, where it lives, its schema, and relationships. Governance manages policy: who can access data, under what conditions, with what masking, and what gets audited. The catalog defines what exists; governance enforces who can use it.

How does xLake's approach to federated governance differ from proprietary catalog platforms?

xLake builds xGovern's federated catalog on Apache Gravitino, so metadata is portable across any engine supporting the Gravitino API. Proprietary catalogs invert this: metadata bound to the platform, enforcement only inside it, and migration requiring a full catalog rebuild.

About Author

The Federated Governance Gap No Data Catalog Has Solved Until Now

What Federated Data Governance Is Actually Trying to Solve

What an Apache Gravitino-Style Catalog Actually Requires

What a Catalog That Travels With Data Actually Requires

The Role of the Catalog in AI Governance

How Apache Ranger Completes the Governance Architecture

Federated Governance Without a Traveling Catalog Is Just Federated Risk

Federated Data Governance and Catalog: Frequently Asked Questions

What is federated data governance and how does it differ from centralized governance?

What is Apache Gravitino and what problem does it solve?

Why do most data catalogs fail to support federated governance?

What is the difference between a data catalog and data governance?

How does xLake's approach to federated governance differ from proprietary catalog platforms?

Shivaram P R

Similar posts

Shivaram P R

Hadoop to Kubernetes Migration Playbook: What Platform Teams Should Know First

Shivaram P R

Data Quality for Agentic AI: Why the Cost Is Different

Shreya Bose

Spot Instances and Spark: How to Run Reliably Without Paying On-Demand Prices

Products

The Federated Governance Gap No Data Catalog Has Solved Until Now

What Federated Data Governance Is Actually Trying to Solve

What an Apache Gravitino-Style Catalog Actually Requires

What a Catalog That Travels With Data Actually Requires

The Role of the Catalog in AI Governance

How Apache Ranger Completes the Governance Architecture

Federated Governance Without a Traveling Catalog Is Just Federated Risk

Federated Data Governance and Catalog: Frequently Asked Questions

What is federated data governance and how does it differ from centralized governance?

What is Apache Gravitino and what problem does it solve?

Why do most data catalogs fail to support federated governance?

What is the difference between a data catalog and data governance?

How does xLake's approach to federated governance differ from proprietary catalog platforms?

Shivaram P R

Similar posts

Shivaram P R

Hadoop to Kubernetes Migration Playbook: What Platform Teams Should Know First

Shivaram P R

Data Quality for Agentic AI: Why the Cost Is Different

Shreya Bose

Spot Instances and Spark: How to Run Reliably Without Paying On-Demand Prices