Skip to content

Level-up with a Medallion Architecture

Posted on:August 12, 2023 (6 min read)

āœļø This post is adapted from a contribution I made to dataengineering.wiki. I highly recommend you check out their site & community! šŸ’œ

Header image

ToC

Open ToC

šŸƒā€ā™‚ļø Going for Gold

In the theme of my last article, weā€™re continuing with Databricks-inspired buzzwords. I like this one because it lets me make liberal use of running emojis. šŸ¤·ā€ā™‚ļø

A medallion architecture is a data design pattern used to logically organize data in a lakehouse; to incrementally improve the quality of data as it flows through data quality ā€œlayers.ā€

This architecture consists of three distinct layers ā€“ bronze (raw), silver (validated), and gold (enriched) ā€“ each representing progressively higher levels of quality. Medallion architectures are sometimes referred to as ā€œmulti-hopā€ architectures.

Medallion Architecture

Medallion architectures came about as lakehouse and lakehouse storage formats (e.g. Delta, Iceberg, Hudi) became more widely adopted. That is when Spark and PySpark started to overtake older frameworks for distributed processing.

If you have a background in data warehousingā€” this probably sounds familiar! dbt currently recommends a

staging āž”ļø intermediate āž”ļø marts

process for cleaning data. The convention isnā€™t so important, Iā€™ve also seen:

raw āž”ļø stg āž”ļø curated,
clean āž”ļø views,

plus a bunch of other terms like protected and reporting, but the idea is the same: create simple data ā€œlayersā€ that represent stages in the transformation process and delineate data cleanliness.

šŸš‰ Medallion Stages

šŸ„‰ Bronze

The bronze stage serves as the initial point for data ingestion and storage. Data is saved without processing or transformation. This might be saving logs from an application to a distributed file system or streaming events from Kafka.

šŸ„ˆ Silver

The silver layer is where tables are cleaned, filtered, or transformed into a more usable format. Note that the transformations here should be light modifications, not aggregations or enrichments. From our first example, those logs might be parsed slightly to extract useful informationā€” like unnesting structs or eliminating abbreviations. Our events might be standardized to coalesce naming conventions or split a single stream into multiple tables.

šŸ„‡ Gold

Finally, in the gold stage, data is refined to meet specific business and analytics requirements. This might mean aggregating data to a certain grain, like daily/hourly, or enriching data with information from external sources. After the gold stage, data should be ready for consumption by downstream teams, like analytics, data science, or ML ops.

šŸ§ Why Layers?

The layer concept is not new: you might be familiar with it if youā€™ve worked in data warehousing.

Itā€™s funny in a sense: though weā€™ve seen an explosion of new data tech, many concepts have remained the same. I remember a video that an (excellent) analytics manager showed me in 2018. If youā€™re expecting irrelevant, out-of-date concepts, brace yourself:

Itā€™s a bit on the longer side, but if you shuffle through, youā€™ll find Tom Oā€™Neill, former Co-Founder of Periscope Data (šŸŖ¦ RIP), discussing fundamental data warehousing ideas, many of which are still extremely relevant.

Youtube snippet

Now, donā€™t get me wrongā€” Iā€™m not saying medallion architecture = data warehousing or anything of the sort, really.

Iā€™m simply emphasizing that medallion architecture is just a common pattern in data storage thatā€™s been around for a while. Maybe even before dbt recommended itā€¦ It has different uses in a data lake, but for the analysts out there, you can think of this as a data engineerā€™s version of your ā€œprotective viewsā€ šŸ˜‰

This one just has a fancy name. See, thereā€™s a method behind the madness.

Forest running

šŸ¤“ Why Medallion?

Medallion is similar to our data warehouse analogy with some key differences. So, what does Medallion do well? What doesnā€™t it do so well? Youā€™ve already read this farā€” might as well find out!

šŸ”„ Upstream changes

With a layered architecture, we can eliminate most of the headaches from upstream schema changes. What usually happens when an upstream source changes?

Clever engineers and analysts have taken note! With multiple layers of storage, we have a single breakpoint to remedy changes. Column name changed from createdAt to created_at? No big deal! Weā€™ll just rename createdAt āž”ļø created_at in our silver layer: a simple and incredibly effective solution. Thought this would be further evidence that any data using camel case is extremely sus. šŸ¤Ø

Medallion architecture can be used to protect against schema changes from external sources, which your team has no control over, and internal sources, which ideally have data contracts, but likely donā€™t.

šŸš¤ Lakehouse benefits

Additional benefits are obtained when using a lakehouse storage format, though itā€™s not a prerequisite. Iā€™ve discussed Delta Lake and lakehouse tech before, but the gist is these formats record changes in a ā€œtransaction logā€ and thus have the ability to ā€œtime travelā€ for some retention period.

Separate layers + ACID guarantees + time travel makes versioned and incrementally stored data a realityā€” a boon for disaster recovery, audits, and overall understanding of a data pipeline.

Just think about that. For the duration of your retention policy, you can replay your entire data lake/warehouse and rebuild all downstream sources. šŸ¤Æ

Simplicity

The incremental layers of Medallion beget simplicity and organization. Rather than making every change at once, staged tables bring clarity to data processing and introduce conventions that your team can follow.

As an engineer, thereā€™s a much lower cognitive load to making a change on a tiered processing system than an ad-hoc one. Medallion systems will be easier to maintain and onboard new users.

šŸ™ˆ What Medallion Isnā€™t

Weā€™ve talked about what Medallion is and who itā€™s for, but here are some downsides and problems Medallion doesnā€™t solve.

šŸ“½ļø Recap

A medallion architecture is a data engineerā€™s version of warehouse storage layers. It pairs quite nicely with lakehouse formatsā€” like a fine wine šŸ·. It can help to keep your data tidy and organized. It even comes with some great disaster recovery benefits.

No, itā€™s not going to replace your data warehouse and no, you canā€™t forget everything you know about dimensional modelling. You can however have greater confidence in your data engineering pipelines and use more Forest Gump gifs in your documentation.

I felt like it