A.01Mission dossier

Mission entry / 02

Lufthansa Data Intelligence Pipeline

Built and explored scalable data processing workflows using Python, PySpark, and Databricks during the Lufthansa Industry Solutions Data Intelligence Track. The work focused on ingestion, transformation, monitoring, validation, and analysis of large operational datasets within aviation-oriented engineering environments.

PySparkDatabricksETLAviationData engineeringMonitoring

Playback moduleHero media reserved

Hourly aggregated dashboard of real world flight data on a globe

ScopeLive frame

image

Presentation of live pipeline at lufthansa

B.01

Overview

Raw data is borderline useless. Millions of noisy data points alone are nowhere near enough to make intelligent inferences. Before meaningful decisions can be made, the data must be engineered, structured, cleaned, validated, and transformed into something humans and systems can actually reason about. The process of turning noisy operational data into clean, queryable intelligence (and ultimately dashboard-ready business insights) is what this project explores through a production-style Lufthansa data intelligence pipeline built around a Bronze → Silver → Gold medallion architecture.

Operational and reference data is pulled from the Lufthansa OpenAPI, landed as raw JSON, cleaned and expanded through PySpark, then promoted into optimized analytics tables for inspection, reporting, and engineering use.

C.01

System architecture

Data signal chainOperational data -> engineering intelligence

API query and ingestion

Python ingestion scripts query the Lufthansa OpenAPI for operational flight data and reference datasets, handling authentication, pagination, retries, and problematic records before saving the responses.

OpenAPIPython

Bronze layer

Raw API responses are landed as JSON with ingestion metadata, then streamed into Bronze Delta tables using Auto Loader / Spark Declarative Pipelines.

Raw JSONBronze

Silver layer

Nested payloads are exploded, flattened, type-cast, deduplicated, and cleaned into structured tables. Invalid or incomplete records are routed into quarantine tables instead of silently disappearing.

PySparkQuarantine

Gold layer

Clean operational flight data is joined with reference dimensions such as airlines, airports, aircraft, cities, and countries to create enriched fact tables ready for analysis.

GoldEnriched

Optimized intelligence output

The final tables expose reliable, queryable aviation intelligence with derived fields, operational status, hub-level analysis, and dashboard-ready aggregations.

AnalyticsDatabricks

D.01

Engineering challenges

LOG 01

The Reality of Industry Grade APIs

When I read through the documentation of different endpoints, there was a clear description of what to expect as a response. So I thought it would be as simple as adhearing to that. It was certainly NOT. Most of the airport records worked perfectly. Which somehow made the broken ones even worse.

A handful of records contained malformed or inconsistent structures that silently poisoned downstream transformations. There were airports with no names, nested fields behaved differently between airports, and certain records simply refused to conform to the schema the rest of the dataset followed. (Honorable mention to a selection of train and bus stations which were returned from the airport endpoint)

The dangerous part was that the pipeline didn’t always fail immediately. Sometimes the corruption only surfaced later during transformations or joins, which made the offending airport incredibly difficult to isolate.

The debugging strategy slowly evolved into a kind of “poison pill search.” Instead of repeatedly ingesting the full dataset and hoping for the best, I began narrowing batches, isolating subsets of airports, and progressively identifying which specific records were contaminating the pipeline.

It was one of the first times I encountered a very real data engineering lesson:

One bad row can break an otherwise perfectly valid system.

And unlike normal software bugs, the problem wasn’t code (well kind of but it wasn't my fault) — it was reality itself being inconsistent.

LOG 02

The “Data Lakes Love Mess” Realization

At the start of the project I kept trying to force the Lufthansa API into behaving like a clean database.

I wanted:

perfectly filtered endpoints, English-only responses, consistent structures, and tidy data before ingestion.

Instead I got multilingual arrays, inconsistent endpoint behaviour, nested metadata, and APIs that occasionally ignored parameters entirely.

At first this felt like a flaw in the system.

Then came the realization that completely changed how I thought about data engineering:

Data Lakes love mess.

The purpose of the Bronze layer was not to create perfect data. It was to preserve reality exactly as it arrived, regardless of how chaotic or inconsistent that reality happened to be.

Instead of endlessly fighting the API upstream, the architecture shifted toward:

ingest first,
preserve raw truth,
clean downstream,
structure later.

That mindset change was probably the single biggest conceptual leap of the project, and one our Lufthansa mentor was very happy to see us make.

LOG 03

Accidentally Learning Database Theory

Having what I would call some intermediate experience with reading data from APIs previously, I was decieved into thinking:

Surely we can just query the JSON directly.

Then I opened the Lufthansa airport data.

What looked like a straightforward API response turned out to be a deeply nested ecosystem of multilingual names, embedded coordinate structures, repeated metadata, and relationships hidden inside relationships.

Very quickly I fell into a database theory rabbit hole:

1NF,
2NF,
3NF,
foreign keys,
transitive dependencies,
star schemas,

At some point I found myself regretting having not studied harder in A Level Computer Science some years ago:

“Wait… what actually is 2NF again?”

A few hours later I was designing a fully normalized ER diagram connecting:

countries,
cities,
airports,
airlines,
aircraft,
and live flight status data.

The funny part is that none of this was originally the goal. I just wanted to ingest flight data. Instead I accidentally learned (and achieved goals which were designed to be scored far later in the course) why databases are structured the way they are in the first place.

LOG 04

The LHR → STR Decision

Early on I kept trying to design the entire system at once:

global airport ingestion, airline metadata, aircraft mappings, full relational architecture, multiple endpoints, complete normalization.

Very quickly the project started feeling impossibly large, especially given our rate limit of 1000 requests per hour.

The turning point was reducing the entire scope to one controllable route, which I later expanded to the main european hubs of Lufthansa:

LHR → STR.

That single decision changed everything.

Instead of solving “global aviation analytics,” the task became:

fetch one route, ingest one JSON, upload one dataset, prove one pipeline works.

Suddenly the architecture stopped being theoretical. IDs became meaningful. Joins became testable. The ER diagram stopped being documentation and started becoming an actual working blueprint.

It was a surprisingly important engineering lesson: large systems become understandable when reduced to one minimal, valid slice.

LOG 05

The “Where Does The Data Actually Live?” Crisis

At several points during the project I genuinely could not explain where the data physically existed.

Everything blurred together:

clusters,
catalogs,
schemas,
tables,
volumes,
DBFS,
compute,
storage,
SQL warehouses,
persistence.

The confusion became especially bad once clusters started terminating automatically while the data somehow continued to exist anyway.

There was a recurring existential feeling of:

Surely this file cannot simply be floating around in abstraction-space.

Modern cloud platforms abstract infrastructure so aggressively that beginners temporarily lose object permanence.

The breakthrough finally came when the hierarchy clicked:

Catalog = top-level isolation,
Schema = organizational workspace,
Volume = raw storage,
Table = structured analytical layer,
Compute = temporary processing engine.

Once that mental model settled into place, the entire Lakehouse architecture suddenly stopped feeling like buzzwords and started feeling like an actual engineered system. It was also around then that I was fed up of all these keywords and OUT OF MY OWN FREE WILL (something I was previously convinced was beyond any realm of possibility) made flashcards and good documentation to learn the theory

E.01

External Mission Assets

GitHub