[Q&A] ELT vs ETL

Question : ELT vs ETL, Terminologly wise, one does the load before the transformation and one does it after.. But this doesn't make much sense to me and why it's so important?

Answer : It's important for cost and future proofing reasons. The reasons we had ETL before was because storage particulary cloud storage was expensive so we had to limit the data we were writing in our warehouse to keep costs down. There was loads of benefits for this, the data models were usually well defined, things were built properly by necessity. The downside was it was hard to add new fields or columns after the fact since we likely would have got rid of the original data. Second downside, people had to know what they want when adding in a new data source it had to be a well defined request not just "I want data from this API".

Enter cheaper storage in recent times and suddely we got ELT. We extract the raw data from source and dump it into our datalake/data warehouse. Upside we can chop and change schemas at will since the source data always exists. Downsides less focus on data modeling since everyone is just dumping data into the warehouse and its a free for all.

**practical example of ELT downsides, i've recently had to process 3.3TB worth of data on snowflake in one table in one variant column. When it was made two years ago the original creator found it easier to just extract and dump the one stream into the table then create downstream tables later. This is all well and good initially but 2 years later the data grew and their query time grew with it. So it cost me 8 hours to deconstruct the variant type into a better model that can better server its downstream assets.

Source from : https://www.reddit.com/r/dataengineering/comments/ycqulb/elt_vs_etl/

저작자표시 (새창열림)

See the forest

[Q&A] ELT vs ETL

티스토리툴바