In late March 2026, a Hacker News post about Grafeo — a new embeddable graph database — sparked a discussion about what modern data infrastructures need to fulfill for AI systems.
Traditional relational databases alone don't fully cover many agentic requirements — especially for unstructured data, semantic search, and complex relationship models. Graph structures, vector databases, semantic layers: these are no longer hype technologies but relevant building blocks for many AI deployments.
DACH enterprises building their AI strategy while their data pipelines are stuck in the ETL paradigm often lack the optimal foundation for scalable AI applications.
Where Classic ETL Reaches Its Limits with AI Agents
Five areas where the ETL paradigm is often less suited for agentic AI requirements:
- Schema-first vs. schema-flexible — ETL requires a defined schema. AI agents need schema flexibility for unstructured documents, variable JSON, email text. Rigid ETL pipelines often react poorly to unexpected field structures or schema changes.
- Batch vs. real-time — ETL was designed for nightly batch runs. An AI agent responding in real time to a customer inquiry needs data from now, not last night.
- Pre-transformation vs. on-demand query — ETL pre-transforms all data according to predefined rules. AI agents ask ad-hoc questions no one foresaw — combinations that are harder to pre-transform for many agentic use cases.
- Centralized logic vs. agent-native data access — In ETL, transformation logic lives in the pipeline. In agentic architecture, logic lives in the agent — making direct, flexible data access more important than pre-built reports.
- Monolithic pipelines vs. modular-composable layers — ETL pipelines are often tightly coupled. Agentic AI architectures benefit from modular, independent data layers that can evolve without cascading dependencies.
ETL vs. ELT: The Quick Recap
For those who need the basics:
ETL (Extract, Transform, Load) The classic method: data is extracted from source systems, then transformed (cleaned, merged, aggregated), and only then loaded into the target system. Transformation happens in a separate, specialised process — typically in an ETL tool or on an ETL server.
ELT (Extract, Load, Transform) The modern method: data is extracted from source systems, loaded immediately in raw form into a powerful target system, and transformation happens there — on-demand, with the full computing power of the target system (data warehouse, data lake, data lakehouse).
Sounds like a technical ordering question. It is fundamentally different.
Real-Time vs. Batch: When Do I Need What?
Not every data source needs real-time ingestion. A pragmatic framework:
Real-time streaming (CDC, Kafka, Event Streams):
- Customer status updates (purchase, cancellation, complaints)
- Transaction data in financial applications
- IoT/sensor data for operational AI applications
- Any data point where "yesterday's data" produces wrong agent decisions
Batch (hourly to daily):
- Historical analyses and reporting
- Data from legacy systems without CDC support
- Reference data (product catalogues, price lists) that change infrequently
- Prepared ML feature tables for scoring
Data Quality in an ELT World
The biggest objection to ELT: "If raw data is loaded directly — who takes care of quality?"
Short answer: dbt tests and data contracts.
dbt tests: Every transformation model can be equipped with embedded tests: Is this field not null? Are values within a defined range? Are there no duplicates? These tests run after each transformation and flag quality issues immediately.
Data contracts: Explicit agreements between data-producing teams ("we deliver this schema in this quality") and consumers. This shifts quality assurance to the source — where it belongs.
Schema evolution: What happens when a source system renames a field? In ELT with versioned dbt models, schema evolution is manageable — breaking changes become visible in the git history, not in a hidden ETL configuration.
Tool Landscape for DACH
What Austrian and DACH enterprises use in practice:
Transformation: dbt (open source or dbt Cloud, EU hosting available) is a widely adopted standard tool for modern SQL transformation stacks.
Ingestion: Airbyte (open source, self-hosted or cloud with data residency options) with a broad connector landscape. Fivetran as an alternative managed option.
Storage: Snowflake (EU Business Critical for GDPR-sensitive deployments), Databricks on Azure Frankfurt, BigQuery EU.
Orchestration and scheduling: Apache Airflow (self-hosted), Prefect, Dagster — for orchestrating and managing dependencies of data pipelines.
MuleSoft in the enterprise context: MuleSoft Anypoint Platform remains a strong option in the enterprise context for SAP, Salesforce, and legacy integrations. MuleSoft + ELT are complementary: MuleSoft orchestrates API-based integration and real-time events; dbt/Airbyte handles analytical transformations.
A Practical Data Stack for AI Agents
A practical approach combines four layers:
- Data Lakehouse (Snowflake, Databricks) — A suitable foundation for many modern AI-adjacent data architectures. Both providers offer EU regions and residency options; specific compliance and governance requirements should still be verified case by case.
- ELT transformation (dbt + Airbyte) — Airbyte offers a broad connector landscape for raw data ingestion. dbt transforms data on-demand with versioned SQL models and automated tests.
- Vector and semantic layer — Enables similarity search and business-metric accessibility for AI agents without SQL knowledge.
- Graph layer (optional) — Graph databases can be particularly useful when relationships, dependencies, and knowledge structures are central to the use case.
Migration Path: From Legacy ETL to Modern ELT
Phase 1: Parallel operation (3–6 months) — New ELT for new use cases while legacy ETL continues for existing reports.
Phase 2: Incremental migration (6–18 months) — Migrate ETL pipelines to dbt models one by one, starting with least critical, securing each with tests before decommissioning the legacy pipeline.
Phase 3: Build agentic layer (parallel) — Vector layer and semantic layer for AI agents can be built during migration. First agents can already access new ELT data before full legacy migration.