Writing a storage engine for Postgres: An in-memory table access method (2023)

rubenvanwyk 7 hours ago

I’ve always wondered why OLTP databases didn’t go the route of tiered storage systems: save to memory, cache to NVME, save permanently to object storage, with different levels of guarantees for each level.

hans_castorp 6 hours ago

Oracle's "flash cache" was that, but that was mainly intended to mitigate performance of spinning hard disks. Not sure if that is still a thing though.
If I'm not mistaken, then Oracle's Exadata puts "intelligence" into the storage nodes, so they can evaluate WHERE conditions independently, so they seem to take the role of a compute node as well, not only storage. I don't know if they are capable of evaluating other operations there as well (e.g. aggregations or joins)
- tanelpoder 5 hours ago
  
  Google's (Postgres-compatible) AlloyDB Omni also has similar functionality now - the main DB action, persistence, etc still has to land on persistent block storage, but additional data can be cached for reading on local NVMe disks.
  Oracle's Exadata is a whole another beast (I co-authored a book about it back in 2010 and it has improved even further since then). It's a hybrid, shared storage distributed system - not consensus-based replication (although they support RAFT for global data systems now too), but a distributed, coherent buffer cache (global cache) based database system. As it's shared storage, you can write copies of blocks, WAL to multiple separate storage cells (NVMe or even just remote RAM) via direct RDMA operations, without OS kernel or system calls involved.
  For analytic queries, yep Oracle can push down filtering, column projection, many aggregations and join filters (bloom filters) for early filtering into the storage cells. The bloom filters are used for early filtering of the next table in the join, based on the output of the previous query plan nodes so far.
branko_d 5 hours ago

Probably because of the "D" in ACID transactions, so the transaction log cannot be meaningfully write-cached.
OTOH, writing to tables/indexes is already done "out of order" and aggressively cached in the buffer pool, and flushed to permanent storage only occasionally (and relatively rarely, e.g. SQL Server does it approximately once a minute).
beoberha 7 hours ago

This is what SQL Server Hyperscale does. I’d assume Aurora does something similar too
hardwaresofton 7 hours ago

See:
https://github.com/neondatabase/neon
whizzter 5 hours ago

Even if they wanted to try something like that, it many cases it'd probably require a fair bit of code-restructuring so ideas aren't tried willy-nilly.
PostgreSQL is great in that they've put serious engineering effort into things like SQL standard,reliability,etc , but one thing that's frankly quite horrid in 2025 is that their reliance on a fork-model for processing has left them with many _important global variables_ that needs a fair bit of refactoring to take out (the fork-model does give some stability perks since the code is written in C, so it's not an entirely horrible choice).

o11c 8 hours ago

(2023), but this still seems to be the only real documentation on the internet.

For reference, the (very minimal!) official docs: https://www.postgresql.org/docs/current/tableam.html