I appreciate how wonderfully simple and dependency free this is. More often than not people just want to write a compiler of sorts without bison or yacc, or LLVM, and just convert expressions into assembler or vm-like instructions that _just run_. This is a great starting point for that, and I wish I had something like it 10 years ago.
(Crenshaw's Let's Build a Compiler is an excellent source too if you want to go one step lower and go standard library free, focusing only on function call stacks:
https://compilers.iecc.com/crenshaw/)
I’d argue that you should start with doing an interpreter and push it as long as you can.
But once you’re ready to generate code, use llvm or something like that because you are going to hit very problematic roadblocks early in your effort that will kill your interest in compilers otherwise.
Some examples are: ensuring you can compare expressions for equality, ensuring that changed expressions don’t violate dominance, SSA conversion, avoiding infinite loops when analyzing cfgs. If you don’t start with knowledge of several things like this, you are looking at rewriting your compiler a dozen times. That is a great way to learn about compilers, but perhaps not the first compiler you make.
That's right. Crenshaw's LBAC can make you an interpreter too. LBAC's magic lies in single token read ahead, single character identifiers, and single character digits. It parses its BNF grammar only with recursive function calls (no stacks, no tables for operator precedence, etc). The final language you build is almost useless, but it is done without a standard library (eg. resizable strings, resizable arrays, stacks, queues, associative containers), and serves as an introduction to get one hooked on more advanced compiler theory like SSA. I love OP's work. I consider OP's implementation (and similarities) to be a modern day LBAC. It's just that LBAC is one of those things I'd take to the grave. I have never found such a concise introductory text that produces a working final example in such delightful piecewise prose.
But if otherwise anyone want to try with Yacc/Lex I have an online interpreter/editor with more than 250 examples to try/study at https://mingodad.github.io/parsertl-playground/playground/ and just included a grammar for tinycompiler there (select "Tinycompiler parser" from "Examples" then click "Parse" to see a parse tree for the content in "Input source").
lovely from-scratch implementation, without the magic and complexity of lex and yacc. brings back memories from 30+ years ago when I implemented a subset of C in C using lex and yacc in a compiler course.
in your blog, i would have liked to see the BNF grammar (without having to peek into the code).
now, heres your next assignment :) - do the same for functional programming (lambda calc) or logic programming (prolog). That would round out the education on languages and compilers for computation.
I'm the author of ymyzk/tinyc and was surprised to see my project mentioned here after more than 10 years. Since the README is written in Japanese, let me answer in English here. The project has two backends: one for NASM (x86) and another for LLVM IR.
Only a happy customer, but if you want an amazing experience building a compiler I highly recommend David Beazley's Compiler Course (or anything else you can sign up for) https://www.dabeaz.com/
While I loved this week-long course, I was a bit disappointed that we offloaded all of the code generation to LLVM. I think it would be fantastic to stretch this into a two-week course and actually emit assembly.
When I took the course, each person did their own thing when it came to code generation. I don't recall anyone using LLVM. I think emitting assembly is easier than using LLVM.
The fire.wend code[1] made me smile! I love how oneself tries to push boundaries with a language that barely can do anything, just do prove it to yourself that is works.
Funnily enough, wend looks like what fun programming means to me. C like (prefixed types) syntax, strongly typed but without heartaches of pointers, and simple types.
Every features on top of that has either leaky abstractions and/or nightmare scenarios.
That said I'm not claiming the world should run on this type of code. Or should it.
I think OP is just trying to demystify the complexity of compliers for the sake of an educational resource. It's high quality work, compressed to a very small line count
"Pascal was influenced by the ALGOL W efforts, with the explicit goals of teaching programming in a structured fashion and for the development of system software.[5] A generation of students used Pascal as an introductory language in undergraduate courses."
Its grammar made it relatively easy to write compilers for the language, which could be done as undergraduate exercises. Of course, this did not make it so popular with professional programmers - see http://eprg.org/computerphile/pascal.pdf by Brian Kernighan, although extended Pascals such as Delphi were very productive in the Windows environment.
I have the greatest respect for Prof. Kernighan, but history hasn't been kind to C's unbounded arrays and strings, however convenient they may be for the programmer. Moreover, just two years after his critique, Turbo Pascal would come out with an environment that is still revered as a pioneering and exceptionally productive IDE. It outsold C compilers by multiple orders of magnitude in the mid-1980s. (And that's ignoring unlicensed copies.)
Turbo Pascal's successor, Delphi, was a successful rapid application development environment, and still being used today. Pascal was used extensively at Apple for the Apple III and Lisa, and was also the original development environment for Mac apps.[1]
But he was probably right about Ada, which has survived and may even be enjoying something of a renaissance as a fix for the sins of C (and C++) that isn't as alienating as Rust.
His critice was already outdated by 1978, because Modula-2, Pascal's sucessor, explicitly designed for systems programming, after Niklaus Wirth sabatical year at Xerox PARC, where he learned about Mesa, Xerox Star written in Mesa, the IDE like experience of Tajo and XDE, he came back to ETHZ and created Modula-2, Lilith personal workstation and related OS in Modula-2.
Modula-2, designed for memory safe systems programming, in 1978, succedding Pascal, had support for unbounded arrays and strings.
By the way, GCC nowadays has GNU Modula-2 integrated in the official set of supported languages.
Turbo Pascal wasn't just a good tool, it was a pioneer in how it was marketed and priced. It was relatively affordable, only $49, at a time when tools generally cost in the hundreds. And they ran monthly ads in nearly every computer magazine. If you wanted to hack on things on your PC, and you didn't already work for a company that would buy you to tools, you really had two choices, BASIC and Turbo Pascal.
I used to be a professional Delphi programmer, but I would not have described Delphi as "extremely successful", partially due to its mismanagement by the various owners of the IP. Its noticeable that Delphi's inventor went on to far greater success with a C-like language - C#.
Edited. Fortunately C# (like Java) fixes what Prof. Kernighan praised about C - unbounded pointers/arrays/strings. At the expense of unpredictable performance due to garbage collection. C# and Java (and JavaScript) retain C-like syntax, but their underpinnings/VMs were heavily influenced by Self and Smalltalk. (Note Pascal also pioneered bytecode compilers.)
Thanks! I appreciate your discussion of semantic analysis for wend. I've literally just embarked on creating a (yet another) "compiles to JSX" language[0][1]. I'm using ANTLR for my lexer/parser, but I've just hit a point where I can start doing semantic analysis :)
That sounds interesting! He seems to have four tiny renderers pinned on his GitHub page; is https://github.com/ssloy/tinyrenderer the one you're recommending? What do you like about it?
The "week-end" part refers to the fact that I wrote it in one week-end with little to no experience in compiler design.
You can check the commit history, 12 to 14 Jan, 2024 :)
The hardest part (IMHO) about getting a simple language running, particularly for beginners, is the “expression”. Mostly for your typical algebraic style infix expressions with precedence.
Just getting the grammar straight on the naturally recursive structures can be a trick. They also touch a large portion of the code generation and run time.
> The hardest part (IMHO) about getting a simple language running, particularly for beginners, is the “expression”
This is why teaching material should be tailored towards teaching rather than implementing demented stuff from the real world just because it is "there." You can easily implement a parser that ignores precedence and tell the reader that we are using brackets to force precedence instead.
In a real-word parser, you might use something like this[1] or some other home-grown algorithm if you didn't know about it. Doesn't matter as long as the parser works.
Is that really the hard part? I figured that part out before I knew much programming: made a simple Excel type formula parser. (A horrible, horrible solution based around replacing strings and manually counting parentheses... but it worked!)
I never got much farther than that though, I've looked into making a compiler many times and got overwhelmed every time.
Good to know I wasn't alone in writing terrible expression parsers. Since I knew absolutely nothing about parsing, my parser/interpreter consisted of splitting on whitespace, followed by linear scanning to find the most high precedence operation, repeated until a single token remains (basically how a human would do it).
well, in my personal experience, parsing is the easiest part. it's what you do with the AST once you have one that is the hard part. but for some reason an overwhelming amount of writing is about frontends when you could teach the concepts to a layman with some boxes and arrows.
I think the hardest part is function calls. Implementing the standard calling convention is really fiddly and annoying when you have to save and restore registers around function calls if you want to do it efficiently (keeping track of which registers are dirty, doing parallel loads of registers into the right argument registers, spilling when appropriate, etc. It is fiddly.
The alternative is going to an IR and doing full register allocation but then you need to implement Lengauer-Tarjan to get into SSA, all the same parallel load stuff for phi functions/block arguments, out-of-SSA, reconstruct in optimisation passes all the information you discard by going to a linear IR, etc.
I'd argue that Lengauer-Tarjan is overrated. While it is neat, it's premature optimization IMHO. Fully maximal SSA (at the entry of each block insert a phi-node for each variable) + dead code elimination (we need it anyways) is much simpler.
Typical implementations of Lengauer-Tarjan are often taken verbatim from Andrew Appel's book and involve higher constant factors than alternative algorithms - such as Cooper et al's "engineered" version of the usual fixpoint algorithm, which is very neat, simple, and performs well.
If you are saying that constructing SSA following the classical approach is a premature optimisation, perhaps you would prefer Braun et al's retelling of Cliff Click's SSA construction algorithm - which works backwards from uses to place phis and requires no dominance information to be computed.
Why not fully maximal SSA + DCE? I mean, other than timings consideration.
Fully maximal does not even need any graph traversal, it is entirely local to each block.
I think it will introduce too many redundant phis, but I've never used it in practice - so I can only speculate. I'm not convinced DCE will clean maximal SSA up substantially. Even the classic Cytron et al algorithm must be combined with liveness analysis to avoid placing dead phis (that is, it does not produce pruned SSA by default). In the past, there was always a fear that SSA has the potential to cause quadratic blowup in the number of variables - this concern is mostly theoretical but probably influenced some of the design decision around algorithms for constructing SSA.
Braun et al's algorithm works backwards from uses (which generate liveness), so you get pruned SSA out. In the case of reducible control flow graphs, you also get minimal SSA. This is all without any liveness or dominance computation beforehand. Granted, you may want those things later, but it's nice that you can construct a decent quality SSA with a fairly intuitive algorithm. Also shown in the paper is that you can incorporate a few light optimisations during SSA construction (constant folding, for example).
I'll definitely test maximal SSA + DCE in tinyoptimizer when I'll get to it. For the moment I made [1] somewhat-pruned mem2reg pass, not even sure how to call it. But it indeed required to compute dominance frontiers.
I'm always happy to see more accessible resources for compiler writers.
---
As an aside: for displaying CFGs on the page, it would be very interesting to emit something somewhat dynamic. SVGs are always a good start, but there is a neat library for doing hierarchical graph layout (dagre, with d3-dagre handling rendering as well). In my own efforts at pedagogy, I've been interested in producing CFGs in-browser whose basic blocks comprise a "unified diff" view of the block (this being achieved in realtime by maintaining a subset of LLVM whose CFGs are persistent). Then it is more obvious what has changed: at least in the case of mem2reg which shouldn't introduce new blocks or move too much around (I forget if it hoists allocas to the entry block or not).
It'd also be cool to distil what underlying ideas you have found to be most useful in your efforts. The constrained scope of them may be useful to me, as I've wanted to create a kind of "advent of compilers" for years (advent of code but with a heavy slant towards compiler tasks/algorithms).
If all you do is construct an AST and interpret it it's not really a compiler is it. It doesn't compile from source to target. At most you're describing a compiler front end (arguably), or an interpreter.
I would expect:
lexer + parser + AST definition and construction + semantic analysis/type checking + X + codegen to ASM/WASM/C
where X includes
- definition of an intermediate representation (IR)
- lowering AST to IR
- static analysis
- improvers/optimisation passes (at least some simple stuff)
- code gen including register allocation
This can still be a simple "complicated" program, but there's more to a compiler than an AST interpreter.
> If all you do is construct an AST and interpret it it's not really a compiler is it.
What if I dumped the AST to disk, called it "bytecode" and updated the "interpreter" to run this "bytecode"? Would you consider that to be a compiler? :-)
> where X includes
I don't consider any of these things to be essential. There are entire languages out there which run on interpreters of some kind. There are others which compile down to C/JS. Many compilers use LLVM as their backend. By your definition, none of them use (or are) compilers.
:) No. Under my definition there needs to be some non-trivial transformation into or out of an intermediate and/or target representation (either syntax-directed directly out of the parser, or from a materialized AST). Personally I would argue that even if you trivially "compiled" to sequential bytecode, if the bytecode is then interpreted it is hard to argue that you have created a compiler (the pre-JIT cpython interpreter is still an interpreter, even though it includes a bytecode representation). But I can see that this point can be argued, and historically a school exercise of writing a Pascal-to-pcode converter would be called a compiler, so sure, you can take that perspective if you like. Just don't get confused about whether you are learning to build a compiler (definition 1) or a compiler (definition 2).
I have written all kinds of compilers, interpreters and vms over the last 15 years. A couple of them for work. The rest for fun. Some vms used RC, others did mark-and-sweep GC. Some vms even did JIT. A lot of them were simple treewalk interpreters because the point was to play with the syntax of the language.
Based on that experience, I would say your definition is more academic than realworldly as javac (or any compiler targeting the JVM) is not a real compiler according to it.
> javac (or any compiler targeting the JVM) is not a real compiler according to it
I think it depends on how you interpret my "non-trivial transformation into or out of an intermediate and/or target representation". For example if javac involves IR, SSA based analysis and transformation (which I assume it does) then that would be a non-trivial transformation and I'd call it a compiler. If on the other hand it was a direct syntax-directed transformation to java byte code then I'd call it a translator not a compiler. But I'm not claiming any authority over definitions here. Words aren't my strong suit.
I am fairly certain that it does not. The JVM is a stack-machine.
I wrote a compiler for a simple programming language that compiled down to JVM bytecode for a project in the early 2010s. Its output for test programs was as fast as the comparative Java ones because they were fairly similar at the bytecode level.
The HotSpot JVM is where all the optimization effort is applied.
I believe there’s some trivial optimization done by the frontend (constant folding, simplifying string concatenation, converting switch statements to lookup tables) but I think they’d agree with you that javac doesn’t really fit their definition of a compiler and that the HotSpot JIT is the true compiler based on how they talked about the Python interpreter pre-JIT.
javac does very, very few things. It intentionally emits byte code close to the source program. The HotSpot compiler in the JVM does the heavy lifting.
I strongly disagree with your definition. A naive syntax-directed translation to bytecode or machine code is still compilation. Lots of production compilers do exactly that eg Wirthian compilers.
Honestly, antlr made this pretty straightforward to me. I didn't want to work in java, but they have a ton of targets. You can definitely write it all yourself, and that's a great learning exercise. But I wanted to get a parser for a language idea I had in mind and it took a couple of days with antlr (https://github.com/chicory-lang/chicory)
If you know basic programming (js/python), a day is more than enough to get the concept across using a tiny language. Could probably be done in an hour.
The problem, always, is people are given terrible reading recommendations which makes the subject more complicated than it is.
I never took a compilers course. The first reading recommendation I got was The Dragon Book. Perhaps not the worst starting place at the time (mid 90s) but my memory of it was that the book was front-loaded with automata and parser theory. At the time I didn't really make it past the front end. It would be interesting to see a compiler tutorial that started at codegen and worked backwards (or started at the ends: lexing and codegen, and worked towards the middle in alternating chapters).
> It would be interesting to see a compiler tutorial that started at codegen and worked backwards (or started at the ends: lexing and codegen, and worked towards the middle in alternating chapters)
This is exactly how my university's compiler course did it[1]. It was really nice to start with assembly and go upwards. We had all the lexing/parsing stuff being discussed around the middle of the semester rather than being saddled with theory-heavy regex, automata theory, etc right at the beginning.
The Dragon Book is terrible as an introduction. There are better books I would probably recommend, but not to a beginner. The best "complete" intro out there today is Nystrom's Crafting Interpreters.[1]
> There are better books I would probably recommend
I'm curious what you'd recommend.
For what it's worth my goal was to compile to machine code. Anything less would have seemed insufficient. Later I got Appel's "Modern Compiler Implementation in Java" and Allen and Kennedy "Optimizing Compilers for Modern Architectures".
Cooper and Torczon "Engineering a Compiler" was recommended here recently. I haven't seen it.
I don't think that the ghuloum 2006 paper was the initial basis for nanopass,
at least the paper A Nanopass Framework for Compiler Education∗ seems to predate it in 2004
https://dl.acm.org/doi/10.1145/1016848.1016878
I got started from the books that were available here in the early 2000s. The one I like the most is "Programming Language Pragmatics" by Michael Scott, mainly for its simplicity. Then there is "Advanced Compiler Design and Implementation" by Steven Muchnick. I have Allen-Kennedy's book too, but I think it went over my head when I went through it. So I kept it aside.
"Engineering a Compiler" is quite approachable.
But, once you know the basics, you get the best bang-for-the-buck by looking at source code of various compilers and virtual machines.
All a compiler does is take a well defined data structure and transform it into another, sometimes by encoding or sometimes by decoding. Really it's bread and butter stuff that every engineer does everyday but with the benefit that you can say something is undefined/out of scope way easier than talking to APIs you don't control.
The fun part are the magic algorithms you sometimes get to use
> All a compiler does is take a well defined data structure and transform it into another
Sorry, having worked on language semantics topics, "well-defined" is not how I would describe either a programming language or the behavior of machine code.
On the same way than cproc+QBE I guess, 70% of gcc speed in my benchmarks.
The more real-life alternatives to those abominations of gcc and clang, the merrier.
I wish the linux kernel devs did care to keep the door _reasonably_ open for such compilers (with assembler source files as alternative to inline assembly, some extensions avoidance and niche expensive compiler features).
This is another case of why super complex syntax computer languages should be avoided like hell (c++, rust, etc).
QBE is ... problematic. The source has no comments, single letter variable names. It is not written to be extended by anyone else other than the author.
I would recommend libfirm over QBE by at least 10km.
It won't surprise you to learn that I'm always interested in hearing about alternatives to QBE, but from the documentation, libfirm sounds more like a heavyweight alternative to LLVM, not a lightweight alternative to QBE. How big is it?
I had a wall of supporting tokei but decided against it, but the short of it is, LLVM is phenomenally massive, and libfirm at 135kloc and 35kloc of comments is but a raindrop that is in the 15Mloc+ that is LLVM.
Perhaps instead of looking towards libfirm, we should look at cranelift. It is of comparable size and actually includes 12kloc of documentation, not just comments.
If it's roughly an order of magnitude more complexity than QBE, it doesn't sound like an alternative, even if it's still two orders of magnitude smaller than LLVM.
Cranelift has been looking interesting. Also, LuaJIT.
We agree on a lot of things, but QBE is off the table. Complexity and kloc are loosely correlated. Cranelift and libfirm are of equivalent size while Cranelift is in Rust and has wonderful compiler engineers working on it.
This code looks a lot like the Unix V6 kernel code or the source code to Steve Johnson's pcc C compiler. It's indeed too sparsely commented, but it's not as bad as you made it sound.
On the topic of QBE, I've always felt that someone aiming to do the same scope of project ought to make their IR a syntactic subset of LLVM IR. If you do that, your test suite can involve invoking LLVM for a comparison.
As for QBE itself, many of the core transformations are fairly standard, which makes it somewhat more approachable for me (e.g. Cooper et al's dominance algorithm, Rideau et al's parallel moves algorithm, etc.). Of course, this doesn't negate the fact that it's not as "hackable" as they probably intended.
But it is written in plain and simple C99, so it is at least much less toxic than LLVM.
I wonder how cparser (did not check if it was plain and simple C)+libfirm compare in performance to cproc+QBE on my benchmarks. May have to take some time to check that.
Whatever the results, it is always good to have, again, a real life alternative for optimizing C toolchains.
The main issues are the heavy usage of gcc extensions by linux (and glibc, etc).
Oh... What I read did not mention that many backends. Don't read the doc, read the code... :(
Definitely worth some benchmarks, until libfirm SDK is reasonable: there is a bad start with cmake (c++) and I have no idea if it requires expensive dependencies (even if those are written in plain and simple C).
Not really. Relativity goes back to Galileo at least, and it's development by Einstein relied on Lorentz and Riemann among others. Science really is a process of /standing on the shoulders of giants/ as Newton is reputed to have said.
That doesn't diminish Einstein's achievement of course.
> Personally, I get a twisted pleasure wearing a T-shirt with this print in the 2020s, even though I hate being a walking billboard. But the number of angry comments from confused people makes up for any moral discomfort of advertising on my belly!
It's possible the 'angry comments' are from people who may have mistaken it for an antisemitic, white supremacist symbol[1].
I appreciate how wonderfully simple and dependency free this is. More often than not people just want to write a compiler of sorts without bison or yacc, or LLVM, and just convert expressions into assembler or vm-like instructions that _just run_. This is a great starting point for that, and I wish I had something like it 10 years ago.
(Crenshaw's Let's Build a Compiler is an excellent source too if you want to go one step lower and go standard library free, focusing only on function call stacks: https://compilers.iecc.com/crenshaw/)
I’d argue that you should start with doing an interpreter and push it as long as you can.
But once you’re ready to generate code, use llvm or something like that because you are going to hit very problematic roadblocks early in your effort that will kill your interest in compilers otherwise.
Some examples are: ensuring you can compare expressions for equality, ensuring that changed expressions don’t violate dominance, SSA conversion, avoiding infinite loops when analyzing cfgs. If you don’t start with knowledge of several things like this, you are looking at rewriting your compiler a dozen times. That is a great way to learn about compilers, but perhaps not the first compiler you make.
That's right. Crenshaw's LBAC can make you an interpreter too. LBAC's magic lies in single token read ahead, single character identifiers, and single character digits. It parses its BNF grammar only with recursive function calls (no stacks, no tables for operator precedence, etc). The final language you build is almost useless, but it is done without a standard library (eg. resizable strings, resizable arrays, stacks, queues, associative containers), and serves as an introduction to get one hooked on more advanced compiler theory like SSA. I love OP's work. I consider OP's implementation (and similarities) to be a modern day LBAC. It's just that LBAC is one of those things I'd take to the grave. I have never found such a concise introductory text that produces a working final example in such delightful piecewise prose.
Shameless plug for another relatively small compiler written in Python with no dependencies: http://cwerg.org
But if otherwise anyone want to try with Yacc/Lex I have an online interpreter/editor with more than 250 examples to try/study at https://mingodad.github.io/parsertl-playground/playground/ and just included a grammar for tinycompiler there (select "Tinycompiler parser" from "Examples" then click "Parse" to see a parse tree for the content in "Input source").
lovely from-scratch implementation, without the magic and complexity of lex and yacc. brings back memories from 30+ years ago when I implemented a subset of C in C using lex and yacc in a compiler course.
in your blog, i would have liked to see the BNF grammar (without having to peek into the code).
now, heres your next assignment :) - do the same for functional programming (lambda calc) or logic programming (prolog). That would round out the education on languages and compilers for computation.
Not to take away from the author or you, but we had things like this 10 years ago https://github.com/ymyzk/tinyc/tree/master/tinyc
Unless I am mistaken, this uses LLVM?
I'm the author of ymyzk/tinyc and was surprised to see my project mentioned here after more than 10 years. Since the README is written in Japanese, let me answer in English here. The project has two backends: one for NASM (x86) and another for LLVM IR.
Only a happy customer, but if you want an amazing experience building a compiler I highly recommend David Beazley's Compiler Course (or anything else you can sign up for) https://www.dabeaz.com/
While I loved this week-long course, I was a bit disappointed that we offloaded all of the code generation to LLVM. I think it would be fantastic to stretch this into a two-week course and actually emit assembly.
When I took the course, each person did their own thing when it came to code generation. I don't recall anyone using LLVM. I think emitting assembly is easier than using LLVM.
I would love a a longer course.
The fire.wend code[1] made me smile! I love how oneself tries to push boundaries with a language that barely can do anything, just do prove it to yourself that is works.
Code is 10/10. Go for merge! :-)
- [1] https://github.com/ssloy/tinycompiler/blob/main/test-program...
Funnily enough, wend looks like what fun programming means to me. C like (prefixed types) syntax, strongly typed but without heartaches of pointers, and simple types.
Every features on top of that has either leaky abstractions and/or nightmare scenarios.
That said I'm not claiming the world should run on this type of code. Or should it.
I think OP is just trying to demystify the complexity of compliers for the sake of an educational resource. It's high quality work, compressed to a very small line count
Well, Python and Pascal were designed as a teaching language, and ended up being remarkably good for actual programming.
I didn't know this regarding Pascal, do you have any source ref on this?
From wikipedia:
"Pascal was influenced by the ALGOL W efforts, with the explicit goals of teaching programming in a structured fashion and for the development of system software.[5] A generation of students used Pascal as an introductory language in undergraduate courses."
Its grammar made it relatively easy to write compilers for the language, which could be done as undergraduate exercises. Of course, this did not make it so popular with professional programmers - see http://eprg.org/computerphile/pascal.pdf by Brian Kernighan, although extended Pascals such as Delphi were very productive in the Windows environment.
I have the greatest respect for Prof. Kernighan, but history hasn't been kind to C's unbounded arrays and strings, however convenient they may be for the programmer. Moreover, just two years after his critique, Turbo Pascal would come out with an environment that is still revered as a pioneering and exceptionally productive IDE. It outsold C compilers by multiple orders of magnitude in the mid-1980s. (And that's ignoring unlicensed copies.)
Turbo Pascal's successor, Delphi, was a successful rapid application development environment, and still being used today. Pascal was used extensively at Apple for the Apple III and Lisa, and was also the original development environment for Mac apps.[1]
But he was probably right about Ada, which has survived and may even be enjoying something of a renaissance as a fix for the sins of C (and C++) that isn't as alienating as Rust.
[1] https://archive.org/details/bitsavers_applelisapriefHistoryo...
His critice was already outdated by 1978, because Modula-2, Pascal's sucessor, explicitly designed for systems programming, after Niklaus Wirth sabatical year at Xerox PARC, where he learned about Mesa, Xerox Star written in Mesa, the IDE like experience of Tajo and XDE, he came back to ETHZ and created Modula-2, Lilith personal workstation and related OS in Modula-2.
Modula-2, designed for memory safe systems programming, in 1978, succedding Pascal, had support for unbounded arrays and strings.
By the way, GCC nowadays has GNU Modula-2 integrated in the official set of supported languages.
Turbo Pascal wasn't just a good tool, it was a pioneer in how it was marketed and priced. It was relatively affordable, only $49, at a time when tools generally cost in the hundreds. And they ran monthly ads in nearly every computer magazine. If you wanted to hack on things on your PC, and you didn't already work for a company that would buy you to tools, you really had two choices, BASIC and Turbo Pascal.
I used to be a professional Delphi programmer, but I would not have described Delphi as "extremely successful", partially due to its mismanagement by the various owners of the IP. Its noticeable that Delphi's inventor went on to far greater success with a C-like language - C#.
Edited. Fortunately C# (like Java) fixes what Prof. Kernighan praised about C - unbounded pointers/arrays/strings. At the expense of unpredictable performance due to garbage collection. C# and Java (and JavaScript) retain C-like syntax, but their underpinnings/VMs were heavily influenced by Self and Smalltalk. (Note Pascal also pioneered bytecode compilers.)
It surely was all over the place in Europe, and still has some spots, hence why in Germany, you can keep coming to yearly Delphi conferences.
https://entwickler-konferenz.de/program-en/
I really really dislike the, "Why Pascal is Not My Favorite Programming Language" esp when you have a horse in the race as he did.
v nice thanks. I think Pascal was the first language we were thought in school, just before C.
Also Basic, wasn’t it?
I'm aware people here frown upon paid content but I am currently taking this course and it's excellent:
https://pikuma.com/courses/create-a-programming-language-com...
Hey there. Thanks for the mention and for the kind words.
Thanks! I appreciate your discussion of semantic analysis for wend. I've literally just embarked on creating a (yet another) "compiles to JSX" language[0][1]. I'm using ANTLR for my lexer/parser, but I've just hit a point where I can start doing semantic analysis :)
[0]: https://chicory-lang.github.io/
[1]: https://github.com/chicory-lang/chicory
Great project, this is how language exploration should start, and then carry it from there.
I've been a big fan of the author's tiny renderer. Nice to see a tiny compiler too!
That sounds interesting! He seems to have four tiny renderers pinned on his GitHub page; is https://github.com/ssloy/tinyrenderer the one you're recommending? What do you like about it?
Pretty neat. I would have expected it to take a bit longer, but I suppose the optimizing step is the real magic in a compiler.
I don't think one can understand compilers in a "week-end".
The "week-end" part refers to the fact that I wrote it in one week-end with little to no experience in compiler design. You can check the commit history, 12 to 14 Jan, 2024 :)
I think the "week-end" reference up above was a little tongue in cheek, since the modern spelling is a compound word: https://linguix.com/english/common-mistake/week_end_hyphen
Not to take away from this awesome sharing.
Compilers are some of the simplest "complicated" programs out there if you start by throwing yacc/bison/antlr/parser generators in the garbage.
Production compilers are complicated because of the feature set of languages and performance requirements.
You can write a lexer + parser + treewalk interpreter for a simple language in a day if you know what you are doing.
The hardest part (IMHO) about getting a simple language running, particularly for beginners, is the “expression”. Mostly for your typical algebraic style infix expressions with precedence.
Just getting the grammar straight on the naturally recursive structures can be a trick. They also touch a large portion of the code generation and run time.
Get:
working and you’re 80% there.> The hardest part (IMHO) about getting a simple language running, particularly for beginners, is the “expression”
This is why teaching material should be tailored towards teaching rather than implementing demented stuff from the real world just because it is "there." You can easily implement a parser that ignores precedence and tell the reader that we are using brackets to force precedence instead.
In a real-word parser, you might use something like this[1] or some other home-grown algorithm if you didn't know about it. Doesn't matter as long as the parser works.
[1] Top-Down operator precedence (Pratt) parsing https://eli.thegreenplace.net/2010/01/02/top-down-operator-p...
Is that really the hard part? I figured that part out before I knew much programming: made a simple Excel type formula parser. (A horrible, horrible solution based around replacing strings and manually counting parentheses... but it worked!)
I never got much farther than that though, I've looked into making a compiler many times and got overwhelmed every time.
Good to know I wasn't alone in writing terrible expression parsers. Since I knew absolutely nothing about parsing, my parser/interpreter consisted of splitting on whitespace, followed by linear scanning to find the most high precedence operation, repeated until a single token remains (basically how a human would do it).
It's hard if:
1. You dont recognise the recursive nature of the problem,
Or
2. You don't recognise the stack-based nature in the problem (shunting yard algo, etc).
IOW, you did it the hard way :-).
Pratt parsers are a really nice way to handle mathematical expressions like this vs. the usual recursive descent.
https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-...
well, in my personal experience, parsing is the easiest part. it's what you do with the AST once you have one that is the hard part. but for some reason an overwhelming amount of writing is about frontends when you could teach the concepts to a layman with some boxes and arrows.
I think the hardest part is function calls. Implementing the standard calling convention is really fiddly and annoying when you have to save and restore registers around function calls if you want to do it efficiently (keeping track of which registers are dirty, doing parallel loads of registers into the right argument registers, spilling when appropriate, etc. It is fiddly.
The alternative is going to an IR and doing full register allocation but then you need to implement Lengauer-Tarjan to get into SSA, all the same parallel load stuff for phi functions/block arguments, out-of-SSA, reconstruct in optimisation passes all the information you discard by going to a linear IR, etc.
I'd argue that Lengauer-Tarjan is overrated. While it is neat, it's premature optimization IMHO. Fully maximal SSA (at the entry of each block insert a phi-node for each variable) + dead code elimination (we need it anyways) is much simpler.
Typical implementations of Lengauer-Tarjan are often taken verbatim from Andrew Appel's book and involve higher constant factors than alternative algorithms - such as Cooper et al's "engineered" version of the usual fixpoint algorithm, which is very neat, simple, and performs well.
If you are saying that constructing SSA following the classical approach is a premature optimisation, perhaps you would prefer Braun et al's retelling of Cliff Click's SSA construction algorithm - which works backwards from uses to place phis and requires no dominance information to be computed.
Thank you for the reference, I'll check it.
Why not fully maximal SSA + DCE? I mean, other than timings consideration. Fully maximal does not even need any graph traversal, it is entirely local to each block.
I think it will introduce too many redundant phis, but I've never used it in practice - so I can only speculate. I'm not convinced DCE will clean maximal SSA up substantially. Even the classic Cytron et al algorithm must be combined with liveness analysis to avoid placing dead phis (that is, it does not produce pruned SSA by default). In the past, there was always a fear that SSA has the potential to cause quadratic blowup in the number of variables - this concern is mostly theoretical but probably influenced some of the design decision around algorithms for constructing SSA.
Braun et al's algorithm works backwards from uses (which generate liveness), so you get pruned SSA out. In the case of reducible control flow graphs, you also get minimal SSA. This is all without any liveness or dominance computation beforehand. Granted, you may want those things later, but it's nice that you can construct a decent quality SSA with a fairly intuitive algorithm. Also shown in the paper is that you can incorporate a few light optimisations during SSA construction (constant folding, for example).
I'll definitely test maximal SSA + DCE in tinyoptimizer when I'll get to it. For the moment I made [1] somewhat-pruned mem2reg pass, not even sure how to call it. But it indeed required to compute dominance frontiers.
[1] https://ssloy.github.io/tinyoptimizer/mem2reg/
Nice.
I'm always happy to see more accessible resources for compiler writers.
---
As an aside: for displaying CFGs on the page, it would be very interesting to emit something somewhat dynamic. SVGs are always a good start, but there is a neat library for doing hierarchical graph layout (dagre, with d3-dagre handling rendering as well). In my own efforts at pedagogy, I've been interested in producing CFGs in-browser whose basic blocks comprise a "unified diff" view of the block (this being achieved in realtime by maintaining a subset of LLVM whose CFGs are persistent). Then it is more obvious what has changed: at least in the case of mem2reg which shouldn't introduce new blocks or move too much around (I forget if it hoists allocas to the entry block or not).
It'd also be cool to distil what underlying ideas you have found to be most useful in your efforts. The constrained scope of them may be useful to me, as I've wanted to create a kind of "advent of compilers" for years (advent of code but with a heavy slant towards compiler tasks/algorithms).
Thank you!
> lexer + parser + treewalk interpreter
If all you do is construct an AST and interpret it it's not really a compiler is it. It doesn't compile from source to target. At most you're describing a compiler front end (arguably), or an interpreter.
I would expect:
lexer + parser + AST definition and construction + semantic analysis/type checking + X + codegen to ASM/WASM/C
where X includes
This can still be a simple "complicated" program, but there's more to a compiler than an AST interpreter.EDIT: I notice that the author of the original article has also started work on an optimizing compiler: https://ssloy.github.io/tinyoptimizer/
> If all you do is construct an AST and interpret it it's not really a compiler is it.
What if I dumped the AST to disk, called it "bytecode" and updated the "interpreter" to run this "bytecode"? Would you consider that to be a compiler? :-)
> where X includes
I don't consider any of these things to be essential. There are entire languages out there which run on interpreters of some kind. There are others which compile down to C/JS. Many compilers use LLVM as their backend. By your definition, none of them use (or are) compilers.
> Would you consider that to be a compiler? :-)
:) No. Under my definition there needs to be some non-trivial transformation into or out of an intermediate and/or target representation (either syntax-directed directly out of the parser, or from a materialized AST). Personally I would argue that even if you trivially "compiled" to sequential bytecode, if the bytecode is then interpreted it is hard to argue that you have created a compiler (the pre-JIT cpython interpreter is still an interpreter, even though it includes a bytecode representation). But I can see that this point can be argued, and historically a school exercise of writing a Pascal-to-pcode converter would be called a compiler, so sure, you can take that perspective if you like. Just don't get confused about whether you are learning to build a compiler (definition 1) or a compiler (definition 2).
I have written all kinds of compilers, interpreters and vms over the last 15 years. A couple of them for work. The rest for fun. Some vms used RC, others did mark-and-sweep GC. Some vms even did JIT. A lot of them were simple treewalk interpreters because the point was to play with the syntax of the language.
Based on that experience, I would say your definition is more academic than realworldly as javac (or any compiler targeting the JVM) is not a real compiler according to it.
> javac (or any compiler targeting the JVM) is not a real compiler according to it
I think it depends on how you interpret my "non-trivial transformation into or out of an intermediate and/or target representation". For example if javac involves IR, SSA based analysis and transformation (which I assume it does) then that would be a non-trivial transformation and I'd call it a compiler. If on the other hand it was a direct syntax-directed transformation to java byte code then I'd call it a translator not a compiler. But I'm not claiming any authority over definitions here. Words aren't my strong suit.
> if javac involves IR, SSA based analysis and transformation (which I assume it does)
https://github.com/openjdk/jdk/blob/master/src/jdk.compiler/...
I am fairly certain that it does not. The JVM is a stack-machine.
I wrote a compiler for a simple programming language that compiled down to JVM bytecode for a project in the early 2010s. Its output for test programs was as fast as the comparative Java ones because they were fairly similar at the bytecode level.
The HotSpot JVM is where all the optimization effort is applied.
I believe there’s some trivial optimization done by the frontend (constant folding, simplifying string concatenation, converting switch statements to lookup tables) but I think they’d agree with you that javac doesn’t really fit their definition of a compiler and that the HotSpot JIT is the true compiler based on how they talked about the Python interpreter pre-JIT.
javac does very, very few things. It intentionally emits byte code close to the source program. The HotSpot compiler in the JVM does the heavy lifting.
I strongly disagree with your definition. A naive syntax-directed translation to bytecode or machine code is still compilation. Lots of production compilers do exactly that eg Wirthian compilers.
Honestly, antlr made this pretty straightforward to me. I didn't want to work in java, but they have a ton of targets. You can definitely write it all yourself, and that's a great learning exercise. But I wanted to get a parser for a language idea I had in mind and it took a couple of days with antlr (https://github.com/chicory-lang/chicory)
Problem with these tools is, you have to spend time wrangling them instead of learning to write a lexer/parser yourself.
A recursive-descent parser is a beautiful thing and can be implemented very quickly.
5 days if you don't, but have the right tutor.
If you know basic programming (js/python), a day is more than enough to get the concept across using a tiny language. Could probably be done in an hour.
The problem, always, is people are given terrible reading recommendations which makes the subject more complicated than it is.
I never took a compilers course. The first reading recommendation I got was The Dragon Book. Perhaps not the worst starting place at the time (mid 90s) but my memory of it was that the book was front-loaded with automata and parser theory. At the time I didn't really make it past the front end. It would be interesting to see a compiler tutorial that started at codegen and worked backwards (or started at the ends: lexing and codegen, and worked towards the middle in alternating chapters).
> It would be interesting to see a compiler tutorial that started at codegen and worked backwards (or started at the ends: lexing and codegen, and worked towards the middle in alternating chapters)
This is exactly how my university's compiler course did it[1]. It was really nice to start with assembly and go upwards. We had all the lexing/parsing stuff being discussed around the middle of the semester rather than being saddled with theory-heavy regex, automata theory, etc right at the beginning.
[1]: https://ilyasergey.net/CS4212/
The Dragon Book is terrible as an introduction. There are better books I would probably recommend, but not to a beginner. The best "complete" intro out there today is Nystrom's Crafting Interpreters.[1]
[1] https://www.craftinginterpreters.com/contents.html
> There are better books I would probably recommend
I'm curious what you'd recommend.
For what it's worth my goal was to compile to machine code. Anything less would have seemed insufficient. Later I got Appel's "Modern Compiler Implementation in Java" and Allen and Kennedy "Optimizing Compilers for Modern Architectures".
Cooper and Torczon "Engineering a Compiler" was recommended here recently. I haven't seen it.
There is a whole body of non-academic work aimed at practioners for implementing compilers.
Nils Holm's work https://t3x.org/
Teaching and Learning Compilers Incrementally - Jeremy Siek - RacketCon 2023 https://www.youtube.com/watch?v=43VA_QaTRT8
Nanopass https://nanopass.org/
original paper that was the basis for nanopass http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
https://www.reddit.com/r/ProgrammingLanguages/comments/gnzra...
The Crafting Interpreters and the Thorsten Ball books
https://craftinginterpreters.com/contents.html
https://interpreterbook.com/ https://compilerbook.com/
https://www.phind.com/search/cm7efcpv000002e6hqdbfojex
I don't think that the ghuloum 2006 paper was the initial basis for nanopass, at least the paper A Nanopass Framework for Compiler Education∗ seems to predate it in 2004 https://dl.acm.org/doi/10.1145/1016848.1016878
I got started from the books that were available here in the early 2000s. The one I like the most is "Programming Language Pragmatics" by Michael Scott, mainly for its simplicity. Then there is "Advanced Compiler Design and Implementation" by Steven Muchnick. I have Allen-Kennedy's book too, but I think it went over my head when I went through it. So I kept it aside.
"Engineering a Compiler" is quite approachable.
But, once you know the basics, you get the best bang-for-the-buck by looking at source code of various compilers and virtual machines.
All a compiler does is take a well defined data structure and transform it into another, sometimes by encoding or sometimes by decoding. Really it's bread and butter stuff that every engineer does everyday but with the benefit that you can say something is undefined/out of scope way easier than talking to APIs you don't control.
The fun part are the magic algorithms you sometimes get to use
> All a compiler does is take a well defined data structure and transform it into another
Sorry, having worked on language semantics topics, "well-defined" is not how I would describe either a programming language or the behavior of machine code.
There should be a well-defined list of undefined behaviors.
One weekend is enough to survey the general concepts. That’s extremely useful on its own, as now you have a better idea of how to orient yourself.
You set your own bar. you could understand "how a compiler" in a day, but how much do you want to know?
On the same way than cproc+QBE I guess, 70% of gcc speed in my benchmarks.
The more real-life alternatives to those abominations of gcc and clang, the merrier.
I wish the linux kernel devs did care to keep the door _reasonably_ open for such compilers (with assembler source files as alternative to inline assembly, some extensions avoidance and niche expensive compiler features).
This is another case of why super complex syntax computer languages should be avoided like hell (c++, rust, etc).
QBE is ... problematic. The source has no comments, single letter variable names. It is not written to be extended by anyone else other than the author.
I would recommend libfirm over QBE by at least 10km.
https://libfirm.github.io/ https://github.com/libfirm/libfirm
It won't surprise you to learn that I'm always interested in hearing about alternatives to QBE, but from the documentation, libfirm sounds more like a heavyweight alternative to LLVM, not a lightweight alternative to QBE. How big is it?
9x larger in LoC, 50x larger in comments.
I had a wall of supporting tokei but decided against it, but the short of it is, LLVM is phenomenally massive, and libfirm at 135kloc and 35kloc of comments is but a raindrop that is in the 15Mloc+ that is LLVM.
Perhaps instead of looking towards libfirm, we should look at cranelift. It is of comparable size and actually includes 12kloc of documentation, not just comments.
https://github.com/bytecodealliance/wasmtime/tree/main/crane...
If it's roughly an order of magnitude more complexity than QBE, it doesn't sound like an alternative, even if it's still two orders of magnitude smaller than LLVM.
Cranelift has been looking interesting. Also, LuaJIT.
We agree on a lot of things, but QBE is off the table. Complexity and kloc are loosely correlated. Cranelift and libfirm are of equivalent size while Cranelift is in Rust and has wonderful compiler engineers working on it.
https://c9x.me/git/qbe.git/tree/rega.c
QBE is a one person performance art project.
This code looks a lot like the Unix V6 kernel code or the source code to Steve Johnson's pcc C compiler. It's indeed too sparsely commented, but it's not as bad as you made it sound.
... and it is getting 70% of gcc speed...
On the topic of QBE, I've always felt that someone aiming to do the same scope of project ought to make their IR a syntactic subset of LLVM IR. If you do that, your test suite can involve invoking LLVM for a comparison.
As for QBE itself, many of the core transformations are fairly standard, which makes it somewhat more approachable for me (e.g. Cooper et al's dominance algorithm, Rideau et al's parallel moves algorithm, etc.). Of course, this doesn't negate the fact that it's not as "hackable" as they probably intended.
Missing RISC-V code generation.
But it is written in plain and simple C99, so it is at least much less toxic than LLVM.
I wonder how cparser (did not check if it was plain and simple C)+libfirm compare in performance to cproc+QBE on my benchmarks. May have to take some time to check that.
Whatever the results, it is always good to have, again, a real life alternative for optimizing C toolchains.
The main issues are the heavy usage of gcc extensions by linux (and glibc, etc).
Haven't used it, but this looks rather complete https://github.com/libfirm/libfirm/tree/master/ir/be/riscv
Oh... What I read did not mention that many backends. Don't read the doc, read the code... :(
Definitely worth some benchmarks, until libfirm SDK is reasonable: there is a bad start with cmake (c++) and I have no idea if it requires expensive dependencies (even if those are written in plain and simple C).
Circle frontend, that includes all C++17 (when it was current) and the Rust related extensions, was implemented by a single guy.
Relativity was discovered by a single guy.
Which confirms complex matters aren't an issue, being able to handle them might be.
wrong, point missed by miles away.
Maybe, "normal"/"average" devs would help you to get the point.
Those can keep up with Go, as per the language authors point of view regarding target audience.
AI????
Not really. Relativity goes back to Galileo at least, and it's development by Einstein relied on Lorentz and Riemann among others. Science really is a process of /standing on the shoulders of giants/ as Newton is reputed to have said.
That doesn't diminish Einstein's achievement of course.
I made a facetious comment to parry that of pjmlp, whom also made a facetious comment wrt to compiler complexity and what "one person" can do.
https://github.com/seanbaxter/circle
It is like if one tried claiming that Fabrice Bellard was an example of a median dev.
The number of C++ compilers written by one person is one. And because of this, it gives us no predictive power in what pjmlp was trying to assert.
I agree with you, Albert and Lorentz and Riemann got caught in the rhetorical crossfire.
Not exactly the same but here https://github.com/robertoraggi/cplusplus there is one person serious effort to create a C++-23 compiler front end.
It does look serious, and is able to parse one simple case I got at hand that openwatcom and orangec failed, thanks for bringing it up.
You mean Circle can output correct intermediate code for the latest gcc code? Or valve mesa aco code?
That frontend plugs into LLVM for the IR to machine code part, no need for GCC.
It replaces clang, it is not a fork of it.
I thought it was using libfirm... not that c++ diarrhea of llvm... :(
Interesting off-topic—I noticed this paragraph:
> Personally, I get a twisted pleasure wearing a T-shirt with this print in the 2020s, even though I hate being a walking billboard. But the number of angry comments from confused people makes up for any moral discomfort of advertising on my belly!
It's possible the 'angry comments' are from people who may have mistaken it for an antisemitic, white supremacist symbol[1].
[1]: https://www.adl.org/resources/hate-symbol/not-equal
Hum. Did not anticipate that, thank you for pointing out.
Where is that paragraph?
I removed it, no need to offend people with racial slur, especially when it is not intended.
What symbol was it talking about?
[dead]
[dead]