Maybe I'm too stupid to understand the article... How does this achieve performant querying for olap and oltp purposes?
Based on my understanding, olap queries will go to the parquet files which are stored in a columnar fashion and oltp style queries will go to a caching layer that sits on top of those parquet files?
What's the special sauce here? Seems like they're just caching the data which, for all intents and purposes, seems like the same solution of storing another copy of the data which is what they say they're avoiding.
Hi, I work on Lakebase (but not on storage), here's how I understand it.
For Lakebase and Neon, our architecture needs the caching layer regardless (what we call Pageservers). Performing reads from S3 directly is too slow so we reconstruct pages and keep them on an nvme server for faster querying. Changing the format on S3 to be Parquet effectively introduces no additional copies over our existing architecture
But why? I’m skeptical of the idea of unifying storage just because it sounds “elegant” or “cool”. It’s not obvious to me how a single storage engine can compete with purpose-built OLTP and OLAP systems like Postgres and ClickHouse, without significant tradeoffs.
You also mention removing CDC pipelines. I’m curious if the materialization (conversion across formats) can catchup to an OLTP workload that is heavy (50K+ tps), which is pretty common these days. Also CDC if done right and with care can be magical for users and stays native to the OLTP/OLAP data-store.
Third, data Lakes and open formats are suitable for Data Warehousing / Data analyst use-cases than real-time customer facing apps. Sure, you might work on changing that, which is what you are upto, but you’ll always run into tradeoffs, which will make it hard to unleash the best performance, much needed for the latter category.
Conversion is async. The whole point is to never deal with CDC which is error prone and taxing Postgres with occupying a replication slot and burning memory and cpu in the OLTP system.
Super cool stuff. Being able to combine your analytical platform and transactional database into one storage layer without having to set up ETL pipelines in between is really a game changer. Especially since it's just postgres, instead of some proprietary database.
So then would LTAP sit to both the left and the right of the medallion architecture? Meaning would you on the left of Bronze use it as an OLTP and to the right of Gold use it as an OLAP? Currently we've been mainly utilizing it to the right of Gold to develop analytic PERN applications that allow us to reuse the RBAC/ACLs set in Unity Catalog, but from this article it seems like that's only half of its utility?
Part of the value of doing an ETL pipeline via streaming replication is you get the full history of data in a table. An SCD type 2 table where each row also has a valid_from and valid_to timestamp column.
How would someone do the same thing with this architecture?
It wouldn't be possible to do this with LTAP architecture since (I'm assuming) the individual logical changes are not visible. But honestly I've always seen SCD type 2 table as a workaround due to lack of data modeling experience in the source database. If you design your tables correctly, you shouldn't need SCD type 2 downstream.
For example, if you know your user can change emails, and there might be events from another source that is keyed by user email (e.g. marketing-related events), then naturally you will need some sort of email_history table that has historical mapping of user id to email (you probably need it for audit purposes too). Then in this case there is no need to build SCD type 2 table of user from CDC, it's already there.
Rather than answering directly, I'm thinking about this problem from the other end altogether ever since I saw the dbricks rt demo. Apologies for the rambling response, as I haven't yet finished thinking about this problem...
We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.
(1) Limited by computation - there's only so much data that we can store on disks and nvme
(2) Limited by wallet - disks and nvme are EXPENSIVE
Also, the tight coupling of compute and data didn't help. It limited the size of databases on the individual expensive compute nodes.
So, another question will be -
What's currently stopping me from keeping the scd history tables right in my oltp db? what's forcing me to copy state into my etl/elt pipeline and the process it into scd into a dedicated olap db?
To some extent,the answer is still the same - the oltp cannot scale for the storage size required for keeping historical data. So, I've had to take out the 'cold' historical data and keep it in my olap freezer.
Now, if oltp itself is scaling, I'm not gonna bother with the copying step. I'll just prefer to store the history in oltp itself.
In my perspective (majorly from handling IoT systems), I need olap for 2 reasons - (1) storage scalability, and (2) analytical processing speed
I now consider (1) to be a solved problem
As for (2), I'm still not sure how this architecture ends up matching the query processing speeds of column-oriented storages. But again, I need to study more.
The SCD pipeline still remains in some form. Either in the form of (1) scd rows that we currently keep (etl pipeline)
, or (2) as older lsn rows that simply don't get deleted (existing db engine).
I've done quite a lot of experimentation with (2), and it is a pretty solid concept to work with.
I've spent quite a lot of years hammering my brain at databases and datastores in general. And I've now got a feeling that this is it.
Finally.
Parquet files are smaller than row based storage in a database (but not those databases with focus on strong compression).
And for backup - the files are probably easier to just copy to multiple disks for redundancy, as opposed to database dumps and incremental backups which at the Petabyte scale will be a pain.
16 comments:
Maybe I'm too stupid to understand the article... How does this achieve performant querying for olap and oltp purposes?
Based on my understanding, olap queries will go to the parquet files which are stored in a columnar fashion and oltp style queries will go to a caching layer that sits on top of those parquet files?
What's the special sauce here? Seems like they're just caching the data which, for all intents and purposes, seems like the same solution of storing another copy of the data which is what they say they're avoiding.
Hi, I work on Lakebase (but not on storage), here's how I understand it.
For Lakebase and Neon, our architecture needs the caching layer regardless (what we call Pageservers). Performing reads from S3 directly is too slow so we reconstruct pages and keep them on an nvme server for faster querying. Changing the format on S3 to be Parquet effectively introduces no additional copies over our existing architecture
From what I have seen, it's basically a Lambda architecture.
How does LTAP architecture deals with major Postgres upgrade? Is it truly zero-downtime for both upstream and downstream?
But why? I’m skeptical of the idea of unifying storage just because it sounds “elegant” or “cool”. It’s not obvious to me how a single storage engine can compete with purpose-built OLTP and OLAP systems like Postgres and ClickHouse, without significant tradeoffs.
You also mention removing CDC pipelines. I’m curious if the materialization (conversion across formats) can catchup to an OLTP workload that is heavy (50K+ tps), which is pretty common these days. Also CDC if done right and with care can be magical for users and stays native to the OLTP/OLAP data-store.
Third, data Lakes and open formats are suitable for Data Warehousing / Data analyst use-cases than real-time customer facing apps. Sure, you might work on changing that, which is what you are upto, but you’ll always run into tradeoffs, which will make it hard to unleash the best performance, much needed for the latter category.
Conversion is async. The whole point is to never deal with CDC which is error prone and taxing Postgres with occupying a replication slot and burning memory and cpu in the OLTP system.
Super cool stuff. Being able to combine your analytical platform and transactional database into one storage layer without having to set up ETL pipelines in between is really a game changer. Especially since it's just postgres, instead of some proprietary database.
So then would LTAP sit to both the left and the right of the medallion architecture? Meaning would you on the left of Bronze use it as an OLTP and to the right of Gold use it as an OLAP? Currently we've been mainly utilizing it to the right of Gold to develop analytic PERN applications that allow us to reuse the RBAC/ACLs set in Unity Catalog, but from this article it seems like that's only half of its utility?
Here's what I don't understand:
Part of the value of doing an ETL pipeline via streaming replication is you get the full history of data in a table. An SCD type 2 table where each row also has a valid_from and valid_to timestamp column.
How would someone do the same thing with this architecture?
Both Iceberg and Delta Lake support 'time travel' so you can query data as it was at a certain date.
SELECT count * FROM my_table AS OF "2025-01-01"
https://delta.io/blog/2023-02-01-delta-lake-time-travel/
https://iceberg.apache.org/docs/latest/spark-queries/#spark-...
It wouldn't be possible to do this with LTAP architecture since (I'm assuming) the individual logical changes are not visible. But honestly I've always seen SCD type 2 table as a workaround due to lack of data modeling experience in the source database. If you design your tables correctly, you shouldn't need SCD type 2 downstream.
For example, if you know your user can change emails, and there might be events from another source that is keyed by user email (e.g. marketing-related events), then naturally you will need some sort of email_history table that has historical mapping of user id to email (you probably need it for audit purposes too). Then in this case there is no need to build SCD type 2 table of user from CDC, it's already there.
Rather than answering directly, I'm thinking about this problem from the other end altogether ever since I saw the dbricks rt demo. Apologies for the rambling response, as I haven't yet finished thinking about this problem...
We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.
(1) Limited by computation - there's only so much data that we can store on disks and nvme
(2) Limited by wallet - disks and nvme are EXPENSIVE
Also, the tight coupling of compute and data didn't help. It limited the size of databases on the individual expensive compute nodes.
So, another question will be -
What's currently stopping me from keeping the scd history tables right in my oltp db? what's forcing me to copy state into my etl/elt pipeline and the process it into scd into a dedicated olap db?
To some extent,the answer is still the same - the oltp cannot scale for the storage size required for keeping historical data. So, I've had to take out the 'cold' historical data and keep it in my olap freezer.
Now, if oltp itself is scaling, I'm not gonna bother with the copying step. I'll just prefer to store the history in oltp itself.
In my perspective (majorly from handling IoT systems), I need olap for 2 reasons - (1) storage scalability, and (2) analytical processing speed
I now consider (1) to be a solved problem
As for (2), I'm still not sure how this architecture ends up matching the query processing speeds of column-oriented storages. But again, I need to study more.
The SCD pipeline still remains in some form. Either in the form of (1) scd rows that we currently keep (etl pipeline) , or (2) as older lsn rows that simply don't get deleted (existing db engine).
I've done quite a lot of experimentation with (2), and it is a pretty solid concept to work with.
I've spent quite a lot of years hammering my brain at databases and datastores in general. And I've now got a feeling that this is it. Finally.
If safe keeper exposes the changes to the tables somehow, a type2 scd is just a windowed lag over the primary key sorted by the timestamp
I don't wanna see that S3 bandwidth bill after running some big query
There’s no S3 bandwidth bill for traffic to and from EC2 in the same region.
There are self hosted object stores which use the same protocol as S3. One example: https://github.com/minio/minio
Parquet files are smaller than row based storage in a database (but not those databases with focus on strong compression).
And for backup - the files are probably easier to just copy to multiple disks for redundancy, as opposed to database dumps and incremental backups which at the Petabyte scale will be a pain.