The parser.py [1] has only 1.6k lines. And it is hand-written parser. This size is amazing if it's really capable, but I intuitively doubt it. For example, duckdb's select.y [2] has 3700 lines, and this is only for SELECT. ZetaSQL's grammar file [3] is almost 10k lines.
The SQL is a monstrous language. Is there any trick that keeps the code simple?
If you look at DuckDB's select.py, you'll notice very "generous" use of newlines, e.g. putting just a ";" on a line, or just "{", or just "}" etc. Partially explains the difference I suppose.
SQLGlot supports nested queries and EXISTS. SOME, ANY, and ALL aren't fully supported yet but it's only a couple lines to add support since I already support EXISTS.
I have a personal question if you don't mind -- I do some SQL query generation and transpilation for both work and hobby.
One headache I've run into recently is generating nested EXISTS() subqueries.
WHERE EXISTS(SELECT 1 FROM Albums WHERE Albums.ForeignKey = t.PrimaryKey
AND EXISTS(SELECT 1 FROM Tracks WHERE Tracks.ForeignKey = Albums.PrimaryKey AND AlbumId = 40))
How to do this is giving me a headache for a lot of reasons and I can't seem to come up with a good way.
Any tips, references, or search terms to google?
Thank you, look forward to digging in more + gave your repo a star!
i rarely ever use exists and prefer to do left joins or left semi joins (in spark)
i'm not exactly sure what you're asking though, in terms of sql generation, it's not difficult for me because i just take in sql and output sql from the ast
I don’t do join order optimizations because that relies on cardinality estimation, I leave that up to the physical plan. But there’s plenty more to optimize like predicate and projection pushdown.
At the large tech companies I've been working at, there are many different big data engines that speak different dialects of sql (presto / spark). People write SQL queries in one language and want to run it in another, but it doesn't just work, there are many parts of the query that need to be manually changed in order for it to run which is tedious and error prone.
Got it. I'm asking because I worked on something similar. The idea was to unit test some Airflow workflows locally. The production workflows were using Hive, but having a local Hive container was too slow for tests, so we wrote a small parser to translate the Hive queries into SQLite queries at runtime. In the end, we had a decent PoC but couldn't complete it because of all the Hive features, but it was super fun.
Where I work, we handle it by having Data Scientists running experiments on our platform commit their queries as Python code to a repository of metrics.
SQLGlot is great. We've used it to extend our FOSS probabilistic data linking library[1] so that it is now capable of executing against a variety of SQL backends (Spark, Presto, DuckDB, Sqlite), significantly widening our potential user base.
We implement the core statistical model in SQL, and then use SQLGlot to transpile to the target execution engine. One big motivation was to futureproof our work - we're no longer tied down to Spark, and so when the 'next big thing' (GPU accelerated SQL for analytics?) comes along, it should be relatively straightforward to support it by writing another adaptor.
Working on this has highlighted some of the really tricky problems associated with translating between SQL engines, and we haven't hit any major problems, so kudos to the author!
First, the SQL involves complex analytical queries on large datasets that need careful pipelining, caching and optimisation. I wasn't sure the extent to which this was possible in sqlalchemy.
Second, it was important that our implementation of the em algorithm (a iterative numerical approach for maximising a likelihood function) was readable/understandable and I felt that readers of the code were more likely to know SQL than sqlalchemy. Certainly i was more comfortable expressing it in SQL than another (i.e. sqlalchemy's) API.
Third, our API allows the user to inject SQL to customise their data linking models and it felt more natural for this to be directly executed rather than go through an abstraction layer.
I'm not a sqlalchemy expert, but my sense is thats it's more appropriate for transaction/atomic SQL than for complex analytical queries
Well you're not wrong per se, sqlalchemy is primarily an ORM system. However one of its components is a module purely for sql generation, which you could (not saying you should) use to generate (complex) sql queries. Obviously their main concern is querying and updating rows, but I think pretty much all common sql constructs are supported (and if not it is fairly easy to add your own).
I can't say that I ever had to implement an EM algorithm that way though, so I can't say how complicated that would be. Certainly it'll take some more effort than writing it in a familiar SQL dialect. The main reason to do it would be that sqlalchemy has pretty good support for quite a lot of databases.
Neat! I did an exploration of sql parsers in different languages [0] and couldn't find much for python. But between this project itself and the couple it lists in the benchmarks I have a few more to look at.
You were a bit too quick in dismissing sqlparse. I had the same initial reaction but it does actually build a tree, more or less, by grouping together tokens. It doesn't build a full syntax tree, though. For instance, it doesn't attempt to identify the FROM or JOINs from a select statement for instance, but it does make sure to group things such that a FROM keyword following a SELECT is part of the same SELECT statement. This makes it flexible but does leave you with slightly more work (though it worked well enough for my purposes).
The parser.py [1] has only 1.6k lines. And it is hand-written parser. This size is amazing if it's really capable, but I intuitively doubt it. For example, duckdb's select.y [2] has 3700 lines, and this is only for SELECT. ZetaSQL's grammar file [3] is almost 10k lines.
The SQL is a monstrous language. Is there any trick that keeps the code simple?
[1] https://github.com/tobymao/sqlglot/blob/main/sqlglot/parser.... [2] https://github.com/duckdb/duckdb/blob/master/third_party/lib... [3] https://github.com/google/zetasql/blob/master/zetasql/parser...
You should try it out and see. Here are the test cases
https://github.com/tobymao/sqlglot/blob/main/tests/fixtures/...
I can even parse and optimize TPC-H https://github.com/tobymao/sqlglot/blob/main/tests/fixtures/....
SQLGlot's parser handles a superset of SQL, so it's much more flexible than any one engine.
If you look at DuckDB's select.py, you'll notice very "generous" use of newlines, e.g. putting just a ";" on a line, or just "{", or just "}" etc. Partially explains the difference I suppose.
Nitpick: The file is called select.y, not select.py
Was confused seeing these mentions of a select.py and why there would be any noticeable amount of semicolons in a Python file.
Good catch! Fixed.
Maybe they're just good at computers
The code looks good
The parser.py seems extremely terse, not a line of doc strings or typing. Perhaps that explains?
No, SQL isd big an ugly. I wonder if it can do
also, can't see SOME/ANY/ALLSQLGlot supports nested queries and EXISTS. SOME, ANY, and ALL aren't fully supported yet but it's only a couple lines to add support since I already support EXISTS.
https://github.com/tobymao/sqlglot/blob/main/tests/fixtures/... https://github.com/tobymao/sqlglot/blob/main/tests/fixtures/...
I'm impressed.
I've added support now for any, all, and some. Some is identical to Any so I always default to Any.
You can see what it took to add this in this commit.
https://github.com/tobymao/sqlglot/commit/ab49a3a2964bdff3a3...
Author here, feel free to ask me any questions!
Something that I'm working on is a pure python SQL engine https://github.com/tobymao/sqlglot/blob/main/sqlglot/executo.... It does the whole shebang, parsing, optimizations, logical planning, physical execution.
Holy smokes, this is super impressive!
I have a personal question if you don't mind -- I do some SQL query generation and transpilation for both work and hobby. One headache I've run into recently is generating nested EXISTS() subqueries.
Imagine you have something like this:
This is all fine, but what if you want to say "A binary operation on related entities": Which you want to generate something like: How to do this is giving me a headache for a lot of reasons and I can't seem to come up with a good way. Any tips, references, or search terms to google?Thank you, look forward to digging in more + gave your repo a star!
i rarely ever use exists and prefer to do left joins or left semi joins (in spark)
i'm not exactly sure what you're asking though, in terms of sql generation, it's not difficult for me because i just take in sql and output sql from the ast
Exists can be very efficient as it allows execution to stop immediately when something is found.
Any plans on supporting snowflake? May I submit a PR?
Id love to support more engines. PRs are very welcome!
Wow this looks neat. I've been looking for an easy to use sql parser (I've tried sqlparse).
How permissive is the parser? Do you need to know the exact dialect up front or can it make some intelligent guesses for unknown UDFs etc?
The parser is very permissive. You don’t need the exact dialect upfront, it handles unknown udfs internally as anonymous funcs.
Thanks, I'll definitely have a look.
Optimisation is a tricky one as you need the various cardinalities, how do you handle that?
I don’t do join order optimizations because that relies on cardinality estimation, I leave that up to the physical plan. But there’s plenty more to optimize like predicate and projection pushdown.
Hey, great work! Can you talk a bit about the use case that inspire you to write the tool?
At the large tech companies I've been working at, there are many different big data engines that speak different dialects of sql (presto / spark). People write SQL queries in one language and want to run it in another, but it doesn't just work, there are many parts of the query that need to be manually changed in order for it to run which is tedious and error prone.
Got it. I'm asking because I worked on something similar. The idea was to unit test some Airflow workflows locally. The production workflows were using Hive, but having a local Hive container was too slow for tests, so we wrote a small parser to translate the Hive queries into SQLite queries at runtime. In the end, we had a decent PoC but couldn't complete it because of all the Hive features, but it was super fun.
yep, we do the same but using duckdb, it's got a lot more analytical functions and closer to hive than sqlite
Where I work, we handle it by having Data Scientists running experiments on our platform commit their queries as Python code to a repository of metrics.
I actually built the system you work on :). It was the main inspiration for this project because PyPika is not a great experience for data scientists.
Thought you might catch that! I've actually helped swap a few things to SQLGlot from pypika.
SQLGlot is great. We've used it to extend our FOSS probabilistic data linking library[1] so that it is now capable of executing against a variety of SQL backends (Spark, Presto, DuckDB, Sqlite), significantly widening our potential user base.
We implement the core statistical model in SQL, and then use SQLGlot to transpile to the target execution engine. One big motivation was to futureproof our work - we're no longer tied down to Spark, and so when the 'next big thing' (GPU accelerated SQL for analytics?) comes along, it should be relatively straightforward to support it by writing another adaptor.
Working on this has highlighted some of the really tricky problems associated with translating between SQL engines, and we haven't hit any major problems, so kudos to the author!
[1] https://github.com/moj-analytical-services/splink/tree/splin...
Any particular reason you chose to use a sql transpiler rather than using e.g. sqlalchemy to generate sql directly?
Good question.
First, the SQL involves complex analytical queries on large datasets that need careful pipelining, caching and optimisation. I wasn't sure the extent to which this was possible in sqlalchemy.
Second, it was important that our implementation of the em algorithm (a iterative numerical approach for maximising a likelihood function) was readable/understandable and I felt that readers of the code were more likely to know SQL than sqlalchemy. Certainly i was more comfortable expressing it in SQL than another (i.e. sqlalchemy's) API.
Third, our API allows the user to inject SQL to customise their data linking models and it felt more natural for this to be directly executed rather than go through an abstraction layer.
I'm not a sqlalchemy expert, but my sense is thats it's more appropriate for transaction/atomic SQL than for complex analytical queries
Well you're not wrong per se, sqlalchemy is primarily an ORM system. However one of its components is a module purely for sql generation, which you could (not saying you should) use to generate (complex) sql queries. Obviously their main concern is querying and updating rows, but I think pretty much all common sql constructs are supported (and if not it is fairly easy to add your own).
I can't say that I ever had to implement an EM algorithm that way though, so I can't say how complicated that would be. Certainly it'll take some more effort than writing it in a familiar SQL dialect. The main reason to do it would be that sqlalchemy has pretty good support for quite a lot of databases.
Neat! I did an exploration of sql parsers in different languages [0] and couldn't find much for python. But between this project itself and the couple it lists in the benchmarks I have a few more to look at.
[0] https://datastation.multiprocess.io/blog/2022-04-11-sql-pars...
It reminds me what compilerworks doing also, I didnt test it tough can not judge similarities
https://www.compilerworks.com/technology/
Awesome, let me know if you have any questions or suggestions if you check it out!
You were a bit too quick in dismissing sqlparse. I had the same initial reaction but it does actually build a tree, more or less, by grouping together tokens. It doesn't build a full syntax tree, though. For instance, it doesn't attempt to identify the FROM or JOINs from a select statement for instance, but it does make sure to group things such that a FROM keyword following a SELECT is part of the same SELECT statement. This makes it flexible but does leave you with slightly more work (though it worked well enough for my purposes).
Could this be used in VSCode as plugin to autoformat/lint my .sql files?
nice one. do you feel that having it in pure python leaves some performance on the table? or is performance not so critical in this use case?
performance is not that important, usually the time to run a sql query takes 100ms > 1 hour, and so it doesn't matter that parsing it takes 0.5 ms.