Abstraction boundaries are optimization boundaries

blog.snork.dev

62 points by delifue 5 days ago

The author is right to note that Haskell can optimise across module (abstraction) boundaries. However I remember that in my childhood that Debray [1] did a lot of work on link-time optimisations (late 1980s). And of course there's the classic self work that fed into the JVM [2], and the whole-of-program compilers that are receiving renewed attention now; mlton [3] being a classic of the genre, "supercompiler" being the active area of research AIUI. So at least sometimes abstraction boundaries are transparent to optimisers.

On the other hand the classic data abstraction story (signatures/interfaces for structures/modules) naturally allows for selecting or optimising implementations depending on uses. There was some great work done in the early 2000s on that (see [4]) and I'm sure the state-of-the-art has moved on since [5].

[1] https://dblp.org/pid/d/SKDebray.html

[2] https://en.wikipedia.org/wiki/Self_(programming_language)

[3] http://mlton.org/

[4] https://dblp.org/pid/59/4501.html

[5] https://en.wikipedia.org/wiki/Interprocedural_optimization

atothayu 2 days ago

hell yea

mrkeen 2 days ago

They are consistency boundaries too.

> This problem is usually caused by a leaky abstraction; the ORM, or whatever database abstraction you are using, can’t anticipate that it would need to send N queries, so it can’t automatically optimize this down to a single query.

If you do an if-statement with an ORM before doing an update-statement, the result of the if-statement is already stale before the update occurs. If you skipped the ORM and put your if-statement as a where-clause, it's less of a problem.

> However, this only works since Haskell is declarative / pure; the low-level operational semantics (like evaluation order) are abstracted away from the programmer, and as such are amenable to optimization.

What else is declarative / pure? SQL. Not ORMS.

sgarland 2 days ago

> What else is declarative / pure? SQL.
Thank you. People so often forget (or don’t realize) that their RDBMS is also doing a ton of optimization on their query, and a primary reason it’s able to do so in real-time is because SQL is declarative.
Dwedit a day ago

In C#, there is Linq to SQL. Linq to SQL is an ORM. It does SQL code generation, even from user-provided code as long as it is in Linq Expressions form.
With DelegateDecompiler, you can turn native lambdas into Linq Expressions. (You just need to re-wrap the decompiled method body to remove the extra "this" parameter from the lambda). With this, you can write C# code, and it will generate SQL code.
- mrkeen a day ago
  
  You just defined ORMs and made no comment on consistency. And Linq is what prompted me to write the comment.
  In this EF code, we're deducting an amount from some balance, not necessarily the highest balance:
  // Find the highest balance var maxAccount = await db.MyAccounts.OrderByDescending(x => x.Balance).FirstAsync(); ... // Deduct an amount maxAccount.Balance -= amount;
  Whereas in SQL, it will work.

joshdata 2 days ago

Is the goal to make good ORM queries easier or to prevent bad queries? It's not clear there's really a compiler solution to the latter. If you're inside a loop in which a database cursor is in scope, then further database queries are prohibited? It's hard to see how that could be enforced other than something like What Color Is Your Function (https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...) with some functions marked as making queries and others as not.

To solve this, maybe instead best practice would be to ensure the database connection is not in a global variable and must be passed down. That would make it more obvious when a database is improperly used within a loop.

The same problem exists for any expensive operation within a loop (say, a database query while parsing the results of an API call, or vice versa).

cogman10 2 days ago

To answer the OP, no. Your compiler will never do the optimization you want it to do, no matter how high you try and move up the abstraction.

The fundamental issue isn't just that your compiler doesn't understand SQL. The problem is that your compiler doesn't understand how data is or will be stored. It's blind to the current state of the dataset.

For example, maybe it's the case that the data is stored in a hashtable and it's rarely written. In that case, N+1 might actually be the right way to query the data to avoid excessive read locks across the database.

Perhaps the data has 1, 2 or several indexes created against it. Those indexes can change at anytime and have to be consciously made as building them can take a lot of resources.

RDBMS build up a huge amount of statistics for optimizations purposes. That's all information you'd have to have the compiler JIT into the application to get similar performance or to have a good feeling for how to do the optimization.

wavemode 2 days ago

> However, what if we raise the abstraction boundary and make the ORM part of the language? This means that we could formulate rewrite rules for the ORM, allowing it to eg merge the N queries into a single query.

It's correct that abstraction boundaries are optimization boundaries, but I don't think you need to make queries part of the language itself to raise the boundary.

To give a concrete example, take the Django ORM in Python. If you write a function which returns a single database record, then calling that function many times in a loop is naturally going to result in an n+1 query. However, if you instead return a QuerySet, then what you're returning is a lazily-evaluated sequence of records. Then, the caller can make the choice on whether to immediately evaluate the query (when they only need one record) or collect together a bunch of QuerySets and union them into a single query.

In other words we give the caller more control and thus more opportunity to optimize for their use case.

wat10000 2 days ago

Abstraction boundaries can be optimization opportunities if you choose the right abstractions. You want to present interfaces that go well with the underlying capabilities. In the case of ORMs, the underlying capabilities include various kinds of set manipulation, so you should present an interface that can filter, union, lazy evaluation, etc.
The key is capabilities rather than implementation. If your data structure is good at iteration and bad at random access, present an abstraction that supports enumeration but not indexing. But don’t present an abstraction that hands out Nodes and lets the caller mess around with their Next pointers.

mrkeen 19 hours ago

Name-dropped both the N+1 problem and Haskell without mentioning Haxl.

Take a look at last decade's solution to this problem:

https://github.com/facebook/Haxl/blob/main/example/sql/readm...

vjerancrnjak 2 days ago

Just avoid ORM. It’s designed to encapsulate, not to be efficient. Lazy loading is part of initial design. Turning it off destroys encapsulation and requires you to know what the code below will fetch.

In that case you might just abandon ORM and preload the context with something more efficient.

nyrikki 2 days ago

IMHO the ORM was an unfortunate choice for trying to develop a generalization about abstractions.

SQL's declarative model has an impedance mismatch with imperative programming, ORMs attempt to deal with that mismatch.

SQL hides complexity but Codd's goals when developing the relational model was to allow non programmers to access data.

The design decisions and tradeoff analysis was massively different than just an abstraction targeting developers.

There are many different persistence models that have vastly different tradeoffs, costs and benefits.

Balancing integration and disintegration drivers are complex and can impact optimization in many ways.

9rx 2 days ago

> ORMs attempt to deal with that mismatch.
Technically ORMs attempt to deal with the mismatch between relations[1] and the rich data structures (objects) general purpose programming languages normally allow expression of. Hence the literal name: Object-Relation Mapping. That SQL, the language, is declarative is immaterial.
> but Codd's goals when developing the relational model was to allow non programmers to access data.
That is unlikely. He was staunchly opposed to database vendors adding "veneer" to make use more friendly and considered SQL an abomination. Codd was dead set on mathematical purity, which is, I'd argue, also why his vision ultimately failed in practice as actual relational databases are too hard to understand for those not well versed in the math.
[1] Technically tables, since we're talking about SQL, which isn't relational. But the mapping idea is the same either way.
- nyrikki a day ago
  
  The 'relation' in the relational model is the tables, specific named columns and tuples, with specific abstracts [0]
  The relation is the table, normalization, foreign keys, candidate keys etc are all extensions to that base model for Codd.
  Some of the impedance mismatch is due to that, and not just the declarative nature or extensions
  Specifically I think the Alice book covers how with a candidate key + the remaining tuples in the row form the row but only the candidate key has an identity, the rest of the row is a substring.
  Some quotes from the link, but searching for 'user' will hit what I think justifies my interpretation.
  > Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).
  > To sum up, it is proposed that most users should interact with a relational model of the data consisting of a collection of time-varying relationships (rather than relations). Each user need not know more about any relationship than its name together with the names of its domains (role qualified whenever necessary): Even this information might be offered in menu style by the system (subject to security and privacy constraints) upon request by the user.
  [0] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
  
  9rx a day ago
  
  > The 'relation' in the relational model is the tables
  No. A table is a list/multiset of tuples, while a relation is a set of tuples. If you squint hard enough they might look similar, but they are not the same. The relational model has no concept of tables.
  > Some of the impedance mismatch is due to that
  ORM doesn't really have an impedance mismatch in and of itself. It is just a data transformation.
  The impedance mismatch that is oft spoken of in association with ORM revolves around the N+1 problem. This is where you get some bizarreness that has to lean on hacks to overcome the real-world constraints that the mathematical model doesn't account for. If you are using something like SQLite that isn't so much of a problem in practice, of course.
  
  nyrikki a day ago
  
  In first order predicate logic with an n-ary relation, attributes are columns and tuples are rows aka a table and what Codd used to justify it.
  The typical normalization let's say in a star schema is viewable as a least fixed point. FO+LFP=P
  It is similar to what Codd called adjacency lists, which luckily died in the 80s, because they conflicted with real adjacency lists. recursive CTEs add transitive closure, which when added to FO gets you to L or NL, I forget which .
  Still it doesn't matter, the relational part of the relational model is attributes or column names...it is equivalent.
  
  9rx a day ago
  
  > the relational part of the relational model is attributes or column names
  The relational part is defined, most importantly, by the set. That was key to Codd's model, and the source of his primary criticism of SQL.
  But his ideas went out of fashion long ago. Postgres, pre-1995, was the last time we saw a relational database anyone heard of. Sure, there have been some esoteric attempts more recently to revive the concept, but they never went anywhere. For all intents and purposes the relational model is dead. We live in a tablational world now.
  > it is equivalent.
  It is not, though. In fact, I posit that lists/multisets are easier to reason about, at least where one doesn't have a strong mathematical understanding (i.e. the layman), and that is why SQL "won". A list, if carefully constrained, can represent a set — which is maybe what you are struggling to suggest — but that does not make it a set.
  
  nyrikki a day ago
  
  Can you provide any specific situation where the following does not hold?
  Especially if I add the constraint that rows have to be unique?
  Relations=Tables Rows=Tuples Columns=Attributes
  I think we may be from different schools of set theory, where I am from the more abstract direction, where set membership, inclusion and equality are separate concepts and don't need to be decided on to define a relation.
  Not that my view is better just different.
  I just happened to come through the {{foo},{foo}}=={{foo}} school.
  
  9rx 17 hours ago
  
  > Especially if I add the constraint that rows have to be unique?; Relations=Tables
  This is kind of like saying that a dynamically-typed programming language is statically-typed because you can add tests to enforce the same constraints as what static typing enables.
  There is a lot of truth in that statement, but you completely fail to capture the differences in calling said language statically-typed. And, sure, for a silly comment on HN, who cares, right? But on the other hand, what are you gaining by obscuring the differences? It is not like said programming language is somehow better because you call it statically-typed. It is still the same language no matter what you call it.
  SQL does not follow the relational model, as it was proposed by Codd. And that's okay. Arguably it is better because of it. Why not call it what it is?
  > Not that my view is better just different.
  Of course you can view things however your little heart desires. You can even see what I call a chicken as being a relational database if that is what floats your boat. But we are not talking about your view, we are talking about Codd's view. Every single comment that preceded this was clearly about Codd's take, to the point that you even linked to some of his work for us to engross in his thoughts. And we know very well how Codd felt about tables. He wrote about everything he thought was wrong with SQL at length.
- reaanb2 20 hours ago
  
  "Object-relational mapper" is a misnomer, they should be called network data model to SQL mappers. They're not primarily for mapping between different arrangements of the same data, but about recreating a network data model perspective. How many ORMs support composite primary keys and non-binary relationships?
  As you said, the challenge of understanding math is why the relational vision failed, but also why the industry keeps reviving the network data model.
  
  9rx 17 hours ago
  
  > How many ORMs support composite primary keys and non-binary relationships?
  Theoretically all. However, often ORM and query building are combined into a unified framework that links them together into a higher order abstraction. You may know this type of thing by names like the active record pattern or the data mapper pattern, with concrete implementations like ActiveRecord and Entity Framework. Such frameworks often make it difficult to use the ORM component in isolation (and even where that isn't the case, doing so largely defeats the purpose of using the framework anyway), and as such you become constrained by the limits of the query builder. I suspect that is what you are really thinking of.
  
  reaanb2 13 hours ago
  
  EF is limited to binary relationships: https://learn.microsoft.com/en-us/ef/core/modeling/relations.... When entities are mapped to rows and relationships to foreign key constraints, it results in a network data model interpretation of SQL databases. The relational model is better conceptualized from a fact-oriented modeling perspective, in which entities map to values, domains to FK constraints, relationships to the association of two or more entity sets in a table, and rows represent propositions about entities rather than entities themselves. However, non-binary relationships don't easily translate into navigational data access (cf. Bachman's paper "The programmer as navigator" about the network data model) and a logical view is difficult for the non-mathematical programmer, while structs and pointers are naïvely easy to understand, that's why ORMs remain popular. The impedance mismatch is less about mapping flat to rich data structures than about the network data model vs fact-oriented perspectives and row-at-a-time vs batch data access.
  
  9rx 13 hours ago
  
  > EF is limited to binary relationships
  But due to the constraints of its querying abilities, not the ORM aspect. ORM is only concerned with the data once it is materialized. How you materialize that data, be it the product of a network relationship, or a fact-oriented model, is immaterial. Remember, EF isn't just an ORM toolkit. It is an ORM toolkit plus a number of other features. But since we're only talking about ORM those other features don't relate to the discussion.
  > while structs and pointers are naïvely easy to understand, that's why ORMs remain popular.
  And, well, because it is an all-round good idea to turn externally-sourced data models into internal data models anyway. It makes life a hell a lot easier as things start to change over time. Even if you were to receive serialized data that is shaped exactly in the structure you want it at a given moment in time, you still would want to go through a mapping process to provide the clean break from someone else's data model and your own. This isn't limited to SQL database tables. You'd do the same if you received, say, JSON from a web service.
imtringued a day ago

ORMs with lazy loading are a bad example, because they are a performance optimization with regards to ORMs that do eager loading. Eager loading requires you to load all associations, which is even more expensive.
The fallacy with lazy loading ORMs is that they often give you no or very poorly suited methods for specifying what needs to be loaded or not. Even when they do, the problem is that you cannot separate the data loading aspects from the querying aspects. You end up with data repository service classes that do "findBooksByYearWithAuthorAndSequels" rather than just "findBooksByYear" and a separate way for specifying "WithAuthorAndSequels" that doesn't require you to come up with a ridiculous naming scheme.

atothayu 2 days ago

this nerdsniped me so hard ty

dacapoday a day ago

Then the next stop: DSL