Towards Multiverse Databases

blog.acolyer.org

90 points by telotortium 6 years ago

SQL isn't hierarchical enough. Everyone who has tried to interpret a join result has experienced this: the DB is trying to pack a tree into a table and something is lost in translation.

It affects performance too -- SQL wants storage locality by table, but applications often want locality by user.

App-level permission / roles is something that everyone runs into with a DB-backed app of a certain size, and it has always felt to me like better hierarchy design could help -- not just foreign keys but 'settings table is under users table', you have to query it via users.

benesch 6 years ago
> It affects performance too -- SQL wants storage locality by table, but applications often want locality by user.
This problem gets substantially worse in geographically distributed databases, like Spanner or CockroachDB, where joining tables whose data is stored in different localities can incur a significant latency penalty (in the hundreds of milliseconds).
They both offer an interesting solution to the problem called interleaved tables. CockroachDB docs on the subject are here [0], and Spanner's docs are here [1].
Interleaved tables don't fix the fact that SQL will hand you back a flat table when you have hierarchical data, but they do fix the spatial locality problem. Here's an example, from the CockroachDB docs, of how users and their order data can be interleaved, exactly the way it sounds like you've wanted:
```
    /customers/1
    /customers/1/orders/1000
    /customers/1/orders/1002
    /customers/2
    /customers/2/orders/1001
    /customers/2/orders/1003
```
[0]: https://www.cockroachlabs.com/docs/stable/interleave-in-pare...
[1]: https://cloud.google.com/spanner/docs/schema-and-data-model#...
oarabbus_ 6 years ago

>SQL isn't hierarchical enough. Everyone who has tried to interpret a join result has experienced this: the DB is trying to pack a tree into a table and something is lost in translation.
Can you expand on this? I'm not sure I follow.
The only dataset that I have worked with where SQL limitations hit hard and fast was a 3D fMRI timeseries dataset, the type of data which isn't meant to be queried using SQL in the first place. Sure there's also certain graph-based datasets which you wouldn't want to use SQL but this is by far the exception and not the rule. You'd be surprised how many datasets can easily be ETL'ed and stored relationally, and even nested tree structures are generally nothing a few SQL views can't fix.
As for your "affects performance" claim - this has been made efficient for DBs like Redshift with compound/interleaved SORTKEYs and all/key DISTKEYs. And other modern warehouses like BigQuery, Snowflake, etc have their own solutions.
I think SQL is just hierarchical enough, not unlike the porridge in the Goldilocks fairy tale.
skybrian 6 years ago

FYI, some SQL databases let you physically interleave rows from different tables, which would let you put all the data for a user in the same place. For example:
https://cloud.google.com/spanner/docs/schema-and-data-model
https://www.cockroachlabs.com/docs/stable/interleave-in-pare...
acjohnson55 6 years ago

I agree that there are times when I want to natively represent hierarchy, instead of having to think relationally. But you can get around this by storing hierarchical data in your tables in most RDBMSs, like JSONB in Postgres.
- awinter-py 6 years ago
  
  I love JSONB but I don't expect it to perform well for unbounded things
  suddenly I have to store all items in a single DB row and read the whole thing to access any of them, giving me a new source of IO inefficiency
  what if the target table is keyed by date and I'm selecting a small subset? JSONB will still force the DB to interact with the entire JSONB blob.
  
  acjohnson55 6 years ago
  
  In Postgres, JSONB can be indexed in some pretty useful ways.
  But generally speaking, at some point, sometimes you have to do the less ergonomic thing in the name of performance. I'm just of a mind to postpone that until it's worth the tradeoff. Often, you can architect in such a way that such a change won't cascade through the rest of your system.
  
  combatentropy 6 years ago
  
  I use PostgreSQL and JSONB a lot, but JSONB is not the type of any column in any of my tables, only in views and functions.
sebazzz 6 years ago

I don't entirely agree. What I want is hierarchical views/indices for certain queries. How the object graph looks differs with each query.
- awinter-py 6 years ago
  
  yes, agree, it's query-dependent, but ORMs have been unpacking joins into sublists for years
  SQL should offer some native way to do this

striking 6 years ago

The lack of this feature is why, for myself, PostgreSQL Row Level Security has been a non-starter.

After all, without a feature like this, you can't query on behalf on multiple users at a time without also encoding your RLS rules in some app queries, and caching becomes more difficult to implement as it now cannot be implemented independently of the app.

Is there any way to retrofit this behavior onto Postgres?

ZitchDog 6 years ago

I have done it, you can store data in a transaction using SET LOCAL and refer to it in RLS policies.
I wouldn't recommend it though, it ends up being a performance nightmare.
- eloff 6 years ago
  
  I'm curious why this ends up being a bad idea. Can you elaborate please?
  
  ZitchDog 6 years ago
  
  Many operators are not leakproof and as such will not push down into the RLS query. We had to manually mark the operators as leakproof even though they were not.
  Operators cannot be made leakproof in RDS, so we had to choose between RDS and RLS.
  Explain plans for the simplest queries get crazy. So when your perf is bad, it's hard to know why (so RLS gets blamed for more than its share, but it also does cause problems)
- LunaSea 6 years ago
  
  Is this due to policy conditions not being indexed or due to another reason?
  I'd be interested if you had more information.
  
  saltcured 6 years ago
  
  We gave up on RLS when we realized that the query engine behaved as if the RLS-filtered table was being fully materialized prior to any join or where clause filtering. So nearly every query turns into full table scan and sort spasms.
  
  taffer 6 years ago
  
  Which version of Postgres did you use? There have been significant improvements in RLS enforcement in Postgres 10: https://www.postgresql.org/docs/10/release-10.html#id-1.11.6...
  
  LunaSea 6 years ago
  
  Was this slow even after the policy conditions were indexed?
  Did you have only equality type RLS conditions or also more heavy ones too?
  
  saltcured 6 years ago
  
  Our typical policies would have to compare a column value to the local session variable where we stored client authentication context attributes. Sometimes this was an embedded column and sometimes we had to use a scalar subquery to cross-walk over to the stored data that acted as a row or row-class ACL.
  Our service now renders its own joins and subqueries, merging the client-based constraints with other application filtering criteria. We are never using prepared statements but instead dynamically render SQL where the client context is baked in as constants in the policy-enforcing filters. The query planner seems to do a much better job deciding on join and filtering orders, particularly when the application filtering criteria sparsify the result so that policies only need to be enforced on a small subset of the tables.
  For RLS to match our current system, I think it would need a way to partially evaluate the policy, so it could reuse common subexpressions and treat them like constants during plan optimization.
  
  ZitchDog 6 years ago
  
  (see my comment above)
atwebb 6 years ago

I am not sure of how RLS works within Postgres, within SQL server, the function is applied at query time if applicable. You can bounce it up against a permissioning table to limit specific users/groups. Would groups/roles not be the way to manage sets of users?

viksit 6 years ago

How exactly is this different from what Oracle has offered as part of Database Vault since the mid 2000s?

[1] https://www.oracle.com/database/technologies/security/db-vau...

ms705 6 years ago

I'm one of the authors of the paper.
Thanks for pointing this out -- while I knew about various advanced Oracle features, I'd missed the Database Vault feature.
The multiverse database idea and Oracle's Database Vault are different, but somewhat complementary:
1) Database Vault is designed to protect sensitive application data against privileged users (DBAs, highly privileged role accounts). This is an important problem: administrator accounts are juicy targets, and restricting their access while still allowing maintenance activities is hard. Multiverse databases, as described in the paper, do not solve this problem, as the administrator still has access to the base universe. But ideas from Database Vault and its "realms" concept could be combined with the multiverse DB concept to achieve this!
2) Unlike multiverse databases, Database Vault does not protect the data against a buggy application, nor does it isolate different end-users within a single application from each other. In practice, this is where a lot of leaks do damage: a bug in the frontend either exposes information or can be exploited to expose it. Database Vault won't help here, as the leak does not even require a privileged user to be involved, and since the application runs inside a single realm.
You can see multiverse databases as the DB vault idea with application-level end-users each having their own realm. Making realms/universes work efficiently at scale, though, poses some serious systems research challenges. It also raises questions about how to write the policies defining what's visible to each end-user -- something that Database Vault doesn't have to deal with, because it merely requires an application-specific policy that indicates what information is sensitive and needs protecting.
jerf 6 years ago

Footnote 5 references "Kristy Browder and Mary Ann Davidson.The virtualprivate database in Oracle9iR2. Oracle Technical WhitePaper, Oracle Corporation. 2002." and in the paper can be found under section 2, "Existing Approaches" with the text "Transparent query rewriting approaches, by contrast, avoid predefined views. Instead, they dynamically insert restrictions congruent with access policies into queries on execution [5, 8, 17]. Both view-based and query rewriting approaches increase the final query’s complexity, slowing it down (e.g., by 3–10×in Qapla [17])."
Does that provide any clues? I honestly don't know since I don't know what that Oracle feature is under the hood.
Also, I'd point out that my initial reaction to your post was a bit hostile, but I noticed this is a paper so "how does it differ from this $PREVIOUS_WORK" is a fair question. I thought at first it was a project someone did, where that question is perhaps still fair, but a bit, ah, strong to lead with, shall we say.

wcdolphin 6 years ago

I have come to believe that privacy and permissions logic tends towards the complexity of your product. Unless you’d consider having much of your application logic be in the data store via procedures or something like VoltDB’s jar based user defined functions, I’d be wary of putting permissions logic in the DB.

dustingetz 6 years ago

Facebook killed raw social graph search a long time ago due to creepy queries and privacy concerns; having throttled access via UI to information is a lot different than having uncontrolled semantic access (especially once data starts getting cached outside their control. For example where they can no longer comply with removal requests.

emilecantin 6 years ago

I'm currently working on an offline-first app with PouchDB as the data store on the front-end, and I'm going to use exactly this concept on the server. Each user will be an actual CouchDB user, with their own databases.

My use-case lends itself well to this paradigm, though. Other apps might not be able to to this as well as mine.

AgentME 6 years ago

The article describes a way to use one database for all users, and mentions cases like "each user can see all posts, but can only see the author field on posts they made". The difference between that and a database-per-user design is that in a database-per-user design, there's no shared data (or if there is, it's manually synced in to all of the databases by something outside of the database).
- emilecantin 6 years ago
  
  Yeah, that's exactly my use-case; there's very little shared data.
  I guess we'll have to wait for other use-cases!

devnonymous 6 years ago

Here's another idea to build on top of this. How about isolating that layer which enforces the access control in a manner so that it acts a bridge between multiple fronteds and multiple databases? Now wouldn't that be awesome? Just a thought.