AI-Native Transformation Framework

Data Engineer

You build the data and AI infrastructure the rest of the company runs on. Pipelines, warehouses, vector stores, model-serving infrastructure, observability — the foundation that lets agents do their work and analysts produce insight. The agent writes much of the code; you design the architecture and own the foundation.


Family
Engineering
Equivalent legacy role
Data Engineer, Senior Data Engineer, ML Engineer (with overlap), AI Engineer (emerging variant), Analytics Engineer
Reports to
Tech Lead, Engineering Manager, Director of Engineering, or Head of Data (in larger orgs)

The work

You own the data and AI infrastructure of the company. ETL pipelines, data warehouses, streaming systems, vector stores, model-serving infrastructure, agent context delivery, data observability. The agent writes much of the code; you design the architecture, validate that the infrastructure holds up under load, and own the foundation that everything else depends on.

This is engineering work that's distinct from application engineering — the material is different (data flows, not user interactions), the failure modes are different (silent data corruption, not visible bugs), and the consumers are different (analysts, agents, internal users — not external customers directly).

Day-to-day, you:

  • Design data architecture. Where data lives, how it flows, what models structure it, how it's accessed at scale. Architecture choices that compound for years.
  • Specify pipelines and infrastructure. The agent implements; you design what's being implemented and validate that it holds together.
  • Build the agent context layer. Agents in production work need reliable, current, well-structured context. Designing the data substrate that supports this is a substantial part of the role at AI-native scale.
  • Own data quality and observability. Bad data corrupts every downstream consumer silently. Designing observability that catches issues before they propagate is core work.
  • Operate the AI infrastructure. Vector stores, embedding pipelines, model serving, evaluation infrastructure — these are first-class production systems with specific reliability and cost characteristics.
  • Validate at risk-graded gates. Routine pipeline changes and standard ETL flow through agent-only review. Schema changes, data deletion, cost-sensitive infrastructure decisions, privacy-relevant changes, and AI infrastructure modifications require your direct approval.
  • Partner with Data Analyst. The DA defines the questions and interprets results; you ensure the data foundation answers reliably. Definitions, instrumentation, attribution — these are co-owned.
  • Partner with application engineers. Their features generate data; that data flows through your infrastructure; your infrastructure feeds back into their features. The seam matters.

What success looks like

Concrete outputs at this tier:

  • Pipeline reliability. Data pipelines run reliably, latency is bounded, failures are caught and recovered. Consumers don't get surprised by missing or corrupt data.
  • Data quality. Issues with data quality, instrumentation, or labeling get caught at the source, not at the downstream consumer.
  • Cost discipline. Compute cost, storage cost, and AI infrastructure cost are visible, current, and improving over time. You can defend infrastructure spend per outcome.
  • Agent infrastructure health. Agents have reliable access to fresh, well-structured context. Vector stores, embedding pipelines, and model-serving infrastructure run with predictable latency and cost.
  • Architecture coherence. Data architecture choices hold up over years. Schema migrations are manageable. The infrastructure ages well rather than collapsing under its own complexity.

What does not count as success: pipelines built, dashboards delivered, technologies adopted in isolation from outcomes.


What makes this work interesting

The interesting part is not the pipelines themselves. It is sitting at the foundation that the rest of the company increasingly depends on.

Your work is genuinely load-bearing. Analysts depend on you. Agents depend on you. Application engineers depend on you. Product depends on you. When data infrastructure is well-designed, the whole company moves faster; when it's not, everything else struggles to compensate.

You design at multiple time scales. Some decisions (schema, naming, partitioning) compound for years; some (specific pipeline tuning) need adjustment quarterly. Data engineers who can hold both time scales produce strong infrastructure.

AI-native operations need new data substrate. Vector stores, embedding pipelines, model serving infrastructure, agent context delivery — these are genuinely new systems being designed in real time. You're part of figuring out the patterns the industry will use.

The diagnostic work is satisfying. When data quality issues show up downstream, the investigation often involves the workflow, the spec, the instrumentation, and sometimes the infrastructure itself. The detective work is rich.

Cost engineering is interesting. Data infrastructure cost has very different characteristics from application infrastructure cost. Storage versus compute, batch versus streaming, fresh versus stale, vector dimensionality, embedding model choice — the optimization surface is novel and meaningful.

Cross-function partnership becomes deep. With routine ETL work absorbed, you have time for substantive engagement with Data Analysts, application engineers, Product, Workflow Architect. The role lives at productive intersections.

Your impact compounds. A well-designed data substrate saves the company years of work. A poorly-designed one creates technical debt that constrains decisions for years.

The career mobility is real. Strong Data Engineers at T3 are recruited heavily. The transferable skills (architecture, system design, cost engineering, AI infrastructure) are valuable across companies and industries.

What may not appeal. Your work is invisible when it works. Nobody notices a pipeline that runs smoothly; they only notice the one that breaks. Recognition is structural and quiet, not loud. Data Engineers who need direct user-facing impact sometimes find the work feels distant. You also work with consumers (analysts, agents, internal users) rather than customers directly, which can feel less concrete than application engineering. And data engineering has historically been undervalued relative to application engineering at many companies — though AI-native operations are changing this rapidly, the legacy under-recognition can still show up in some cultures.


Who thrives in this role

The aptitudes that matter most are systems-thinking, architectural-judgment, and cost-discipline aptitudes — distinct from application engineering strengths.

You think in systems and flows. Data is system-shaped. Engineers who naturally see flows, dependencies, and emergent behavior produce strong data infrastructure.

You care about correctness over speed. Bad data corrupts everything downstream silently. Data Engineers who care about correctness produce trustworthy infrastructure; those who optimize for shipping speed produce hidden problems.

You're comfortable with long feedback loops. Architecture choices show their consequences over months and years. Data Engineers who need fast feedback struggle; those who can design with care and patience produce strong foundations.

You hold cost discipline. Data infrastructure can become expensive fast. Engineers who treat cost as a first-class concern produce sustainable infrastructure; those who don't produce surprise bills.

You can read across consumers. Analysts, agents, application engineers, internal users — they have different needs from the same data. Engineers who can hold multiple consumer perspectives produce useful infrastructure; those who only optimize for one fail others.

You write clearly. Data architecture documents, schema specifications, runbooks. Clear writing is core to the role.

You're suspicious of clean stories. When data looks too clean, you investigate. Healthy skepticism about your own pipelines is essential.

You partner well with adjacent specialists. Data Analyst, Application Engineer, DevOps Engineer, Workflow Architect. Engineers who can translate across these boundaries produce coherent infrastructure.

Less essential than before: mastery of any specific data tool or ETL framework, the ability to write complex SQL by hand, depth in any one specific data store. The agent absorbs these. The role values architecture, judgment, and design.


Skills to develop to get there

The aptitudes describe disposition. The skills below are what you actively build.

Data architecture specification. Writing how data flows, where it lives, what models structure it, with enough rigor that the agent can build and the team can extend. How to practice: draft the data architecture for your current scope. Have a peer challenge; refine.

Pipeline reliability engineering. Designing pipelines that are observable, recoverable, and predictable under load. How to practice: for one pipeline, design the observability and recovery before building. Test by simulating failure.

Cost-per-outcome thinking. Reading data infrastructure cost as engineering data, not finance data. How to practice: monthly, audit cost drivers. Identify the top three opportunities for optimization without quality loss. Implement one.

Schema and definition stewardship. Maintaining coherent data models that consumers can rely on. How to practice: take one core schema. Write the definitive specification. Get cross-function buy-in. The discipline compounds.

AI infrastructure design. Vector stores, embedding pipelines, model serving, agent context delivery. How to practice: for one AI workflow your company runs, document the infrastructure it depends on. Identify failure modes. Design controls.

Data observability specification. What gets measured, what triggers alerts, what's the expected response. How to practice: for each pipeline, write the observability spec before the pipeline. If the answer to "how would we know this is broken?" is "downstream complaints," the spec isn't done.

Cross-function communication. Writing for Data Analysts, application engineers, Product, DevOps, executives. How to practice: draft an infrastructure proposal. Show to one person from each function. Where they get confused is where the writing needs work.

Migration craft. Schema changes, infrastructure migrations, data deletions. The high-risk operations. How to practice: after each significant migration, write a one-paragraph reflection. The pattern across migrations is your training.

Pick the skill that maps to your most recent infrastructure disappointment. Practice it for a month.


How this differs from the legacy Data Engineer role

Legacy Data Engineer (pre-AI)Data Engineer (AI-native)
Substantial time writing pipelines, ETL, and warehouse queries by handPipeline code largely absorbed by agent; time goes to architecture and design
Data infrastructure is internal-only — for analysts and dashboardsData infrastructure now serves agents, analysts, internal users, and external products
Cost discipline is occasional cleanupCost discipline is continuous; AI infrastructure introduces new cost characteristics
Schema decisions are mostly localSchema decisions account for agent consumers, embedding model choices, vector indexing
Observability is afterthoughtObservability is designed in; bad data caught at source, not at consumer
Best engineers are most operationally rigorousBest engineers are sharpest architects, with judgment about cost and reliability
Career path: Data Engineer → Senior → Lead → Director of DataCareer path: same, plus lateral to Workflow Architect, Agent Supervisor (for AI infrastructure focus), Senior FSE with infrastructure depth

The role is not a faster Data Engineer. It is a more architectural one — pipeline implementation absorbs, design judgment expands.


Which role evolution patterns are in play

  • Elevation (primary). The role's center of gravity rises from implementation to architecture, cost design, and observability specification.
  • Convergence (secondary). Boundaries with DevOps (AI infrastructure), Application Engineer (data-aware features), and Data Analyst (definitions, instrumentation) blur as the data engineering role has time for substantive cross-function partnership.
  • Emergence (partial). AI infrastructure — vector stores, embedding pipelines, agent context delivery — is genuinely new responsibility within the Data Engineer scope.

Specialization within engineering applies (the role remains distinct from Senior FSE because the material is different — data flows versus user interactions, silent failure modes versus visible bugs, infrastructure consumers versus customer-facing experiences). The Convergence pattern at T3 dissolves within-application specialty boundaries (FE/BE) more than it dissolves the application-vs-data-infrastructure boundary, which remains operationally distinct.


Related roles in the catalog


Sources & further reading


← Back to Roles · The AI-Native Organization · Role evolution patterns · Reference framework · Engineering for unreliability