If you want 1 dependency rather than 0, there are various low-level libraries designed for building full-featured databases.
For one of these, RocksDB, one reference point is how CockroachDB was built on top of them for many years and many successful Jepsen tests (until they transitioned to an in-house solution).
Bitcask is great, it's such an elegant idea. I've built a few different toy implementations in different languages and learned something new each time. YMMV based on how many deps you do or don't want to use, how complete you want to go, but it's a totally doable small-ish project.
This makes me wonder. Is anyone practicing TDD with genAI/LLMs? If the true value is in the tests, might as well write those and have the AI slop be the codebase itself. TDD is often criticized for being slow. I'd seriously like to compare one vs the other today. I've also heard people find it challenging to get it to write good tests.
I'd sort of invert that and say it's better to use LLMs to just generate tons more test cases for the SQL DBs. Theoretically we could use LLMs to create 100s of Thousands (unlimited really) of test cases for any SQL system, where you could pretty much certify the entire SQL capability. Maybe such a standardized test suite already exists, but it was probably written by humans.
At that point, you'd get a ton more value from doing Property Testing (+ get up and running faster, with less costs).
If I'd had to have either code or tests generated by a LLM, I'd manually write the test cases with a well-thought out API for whatever I test, then have the LLM write tests that implements what I thought up, rather than the opposite which sounds like a slow and painful death.
I hadn't heard of "Property Testing" if that's a sort of term of art. I'll look into it. Anyway, yeah in TDD it's hard to say which part deserves more human scrutiny the tests or the implementations.
Are you sure that LLMs, because of their probabilistic nature, would not bias against certain edge cases. Sure, LLMs can be used to great effect to write many tests for normal usage patterns, which is valuable for sure. But I'd still prefer my edge cases handled by humans where possible.
I'm not sure if LLMs would do better or worse at edge cases, but I agree humans WOULD need to study the edge case tests, like you said. Very good point. Interestingly though LLMs might help identify more edge cases us humans didn't see.
TDD suffers from being inflexible when you don't fully understand the problem. Which on software is basically always.
Everytime I've tried it for something I make no progress at all compared to just banginf out the shape that works and then writing tests to interrogate my own design.
Happy that it's not just me. I tried it a couple of times, and for small problems, I could make it work, albeit with refactorings both to the code and tests.
But for more complicated topics, I never fully grasped all the details before writing code, so my tests missed aspects and I had to refactor both code and tests.
I kinda like the idea more than the reality of TDD.
TDD is supposed to teach you that refactoring of both the code and tests are "normal": iow, get used to constant, smallish refactors, because that's what you should be doing.
Now, the issue with badly defined problems is not that it's just badly defined, it's also that we like to focus on the technical implementation specifics. To do TDD from scratch requires a mindset shift to think about actual user value (what are you trying to achieve), and then go for the minimum from that perspective. It's basically an inverse from common architecture approach, which is design data models first, and start implementing next. With TDD, you evolve your data models along with the code and architecture.
And it is freaking hard to stop yourself from thinking too far ahead and letting tests drive your architecture (code structure and APIs). Which is why I also frequently prototype without TDD, and then massage those prototypes into fully testable code that could have been produced with TDD.
I think in general people tend to overdo TDD if they do TDD, aiming for a 100% test coverage which just ends up doing what you and parent mentions, solidifies a design and makes it harder to change.
If instead every test is well intentioned and focus on testing the public API of whatever you test, not making assumptions about the internal design, you can get well tested code that is also easy to change (assuming the public interface is still OK).
It's extremely hard to really do TDD and get code that's hard to change. If you persevere with a design that's hard to change, every single change in your failing-test-fix-implementation TDD cycle will make you refactor all your tests, and you'll realise why the design is bad and reduce coupling instead.
What really happens is that people write code, write non-unit "unit" tests for 100% coverage, and then suffer because those non-unit tests are now dependent on more than just what you are trying to test, all of them have some duplication because of it, and any tiny change is now blocked by tests.
Really puts the auto- in didact! Very curious to hear how this worked for you; it’s almost directly the opposite of the copilot approach.
I learned assembler by typing in listings from magazines and hand dis-assembling and debugging on paper. Your approach seems similar in spirit, but who has the times these days?
I learned this from Zed Shaw's Learn X The Hard Way books. He says this approach is mainstream in other disciplines, like music, languages, or martial arts.
I also heard the philosopher Ken Wilber spent a few years (in what kids today call Monk Mode) writing out great books by hand.
The main effect I noticed is that I rapidly gain muscle memory in a new programming language, library or codebase.
The other effect is that I'm forced to round-trip every token through my brain, which is very helpful as my eyes tend to glaze over — often I'll be looking right at an obvious bug without seeing it.
I program in neovim with no plugins, no autocomplete and no syntax highlighting. I type everything myself (though I will use copy and paste from time to time). There is a discipline to it that I find very beneficial. As a language designer, it also makes me think very carefully about the syntactic burden of languages that I design. It keeps my languages tight. One of the nice things about typing all of my own code without suggestions is that it eliminates many distractions. I may get some things wrong from time to time, but then I only have myself to blame. And I never waste time messing around with broken plugin configs or irritating syntax highlighting nits.
Yeah these features are overrated. You can still autocomplete based on existing words in the same file by pressing Ctrl+N, but otherwise just typing it out or copying is totally OK.
I've also experienced autocomplete in NetBeans IDE so slow that it was just faster to type it out.
I quite value syntax highlighting though. Back then I used Turbo Pascal 5.5 on PC XT because it was way faster and less demanding than Turbo Pascal 6.0, but I remember not having a syntax highlighting was quite worse experience. You could get used to do without though.
But it also depends on the language. I've seen some Lua code without syntax highlighting and it was just a soup of words, very unreadable. Whereas something like C with symbols is OK.
Wait you took a repo and started typing it into the IDE? Could you please expand on what benefits you noticed and how it affected your understanding of the language? It sounds like a fascinating way to force attention to the code simply reading it wouldn't.
Yeah I just open two panes in Sublime Text, with the source on the right and then I type it out verbatim on the right.
I make an effort to keep the line numbers synced. Sometimes I skip long repetitive blocks or comments. But I do type out like 80% of the actual characters in the file.
It's about 500 lines per hour fot me, so I can estimate reasonably well how long it'll take.
It's not necessarily an efficient thing to do — you'd get way more bang for your buck just poking around, asking questions, trying to make small changes. But for reasonably small projects, you can type it out in a few hours, or a day or two. Then you've "round-tripped" every single token through your brain (though sadly not with a meaningful amount of conscious reflection) -- unless you pause and ask questions along the way.
Not to offend you, and you've already pointed out the better way to do it, but I don't think there is too much to gain from this approach. When I was learning Vulkan for example, the only thing this helped me learn was which functions they were calling from the API. Their variable names and ifdefs and wrapper functions were completely useless to me. I was able to get their 5000 lines down to just 1000-- and that was for a single untextured cube with direct memory management and simple surface handling. Imagine if it had been more complex? 20,000 lines of typing for little reason. My neck aches thinking about it :)
It's a new iteration on the ancient form of learning by copying. I've only ever seen people copy stuff when writing by hand when wanting to memorize something though, I imagine with a keyboard the memory-enhancement effect of writing by hand is lost, but it's probably more effective than just reading alone.
Re: copy-on-write (CoW) B-tree vs append-only log + non-CoW B-tree, why not both?
I.e., just write one file (or several) as a B-tree + a log, appending to log, and once in a while merging log entries into the B-tree in a CoW manner. Essentially that's what ZFS does, except it's optional when it really shouldn't be. The whole point of the log is to amortize the cost of the copy-on-write B-tree updates because CoW B-tree updates incur a great deal of write magnification due to having to write all new interior blocks for all leaf node writes. If you wait to accumulate a bunch of transactions then when you finally merge them into the tree you will be able to share many new interior nodes for all those leaf nodes. So just make the log a first-class part of the database.
Also, the log can include small indices of log entries since the last B-tree merge, and then you can accumulate even more transactions in the log before having to merge into the B-tree, thus further amortizing all that write magnification. This approaches an LSM, but with a B-tree at the oldest layer.
It's cool to build a database in 3000 lines, but for a real production-ready database you'll need testing. Would love to see some coverage on correctness and reliability tests. For example, SQLite has about 590 times more test code than the library itself. (https://www.sqlite.org/testing.html)
I've started learning Velox last year, and it's a staggering amount of code. Sure, it has a ton of dependency because it wants to support so many things, but I feel like the core itself is also very complex.
I'm not sold on complexity being a necessity in software engineering, as I'm sure a lot of you also aren't. Yet we see a lot of behemoth projects.
I've recently started to think about covering some advanced topics like compilers, databases, interpreters. Do u know any good oss repos where I can learn and build them from scratch on my own
Check the reviews online, but generally random bits of Go code are presented with no build-up or explanation. B Trees are implemented immediately and without explanation. The book is incredibly short and seems just like a loose collection of notes, rather than an actual book. Very interested to check out the new version
If you want 1 dependency rather than 0, there are various low-level libraries designed for building full-featured databases.
For one of these, RocksDB, one reference point is how CockroachDB was built on top of them for many years and many successful Jepsen tests (until they transitioned to an in-house solution).
https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/
https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/
Another possibility is Apple's FoundationDB, with an interesting discussion here: https://news.ycombinator.com/item?id=37552085
A similar resource I recently discovered and it’s not that popular: https://github.com/pingcap/talent-plan/tree/master/courses/r...
Writing a Bitcask(KV wal) like db in Rust. Really cool and simple ideas. The white paper is like 5 pages.
Bitcask is great, it's such an elegant idea. I've built a few different toy implementations in different languages and learned something new each time. YMMV based on how many deps you do or don't want to use, how complete you want to go, but it's a totally doable small-ish project.
I looked into this at one point, I was typing out entire codebases for didactic purposes: SQLite 3 was 120,000 lines of code, but SQLite 2 was 12,000.
So for a bit more effort you get a battle tested real world thing!
The proprietary test suite for SQLite3 is much much larger still. The battle-testedness comes in great part from that.
Is that where the 10x more lines came from? Writing more "testable" code?
Oh no, SQLite3 is a lot more featureful than SQLite2. The proprietary test suite is what makes SQLite3 so solid.
This makes me wonder. Is anyone practicing TDD with genAI/LLMs? If the true value is in the tests, might as well write those and have the AI slop be the codebase itself. TDD is often criticized for being slow. I'd seriously like to compare one vs the other today. I've also heard people find it challenging to get it to write good tests.
I'd sort of invert that and say it's better to use LLMs to just generate tons more test cases for the SQL DBs. Theoretically we could use LLMs to create 100s of Thousands (unlimited really) of test cases for any SQL system, where you could pretty much certify the entire SQL capability. Maybe such a standardized test suite already exists, but it was probably written by humans.
At that point, you'd get a ton more value from doing Property Testing (+ get up and running faster, with less costs).
If I'd had to have either code or tests generated by a LLM, I'd manually write the test cases with a well-thought out API for whatever I test, then have the LLM write tests that implements what I thought up, rather than the opposite which sounds like a slow and painful death.
I hadn't heard of "Property Testing" if that's a sort of term of art. I'll look into it. Anyway, yeah in TDD it's hard to say which part deserves more human scrutiny the tests or the implementations.
Are you sure that LLMs, because of their probabilistic nature, would not bias against certain edge cases. Sure, LLMs can be used to great effect to write many tests for normal usage patterns, which is valuable for sure. But I'd still prefer my edge cases handled by humans where possible.
I'm not sure if LLMs would do better or worse at edge cases, but I agree humans WOULD need to study the edge case tests, like you said. Very good point. Interestingly though LLMs might help identify more edge cases us humans didn't see.
TDD suffers from being inflexible when you don't fully understand the problem. Which on software is basically always.
Everytime I've tried it for something I make no progress at all compared to just banginf out the shape that works and then writing tests to interrogate my own design.
Happy that it's not just me. I tried it a couple of times, and for small problems, I could make it work, albeit with refactorings both to the code and tests.
But for more complicated topics, I never fully grasped all the details before writing code, so my tests missed aspects and I had to refactor both code and tests.
I kinda like the idea more than the reality of TDD.
TDD is supposed to teach you that refactoring of both the code and tests are "normal": iow, get used to constant, smallish refactors, because that's what you should be doing.
Now, the issue with badly defined problems is not that it's just badly defined, it's also that we like to focus on the technical implementation specifics. To do TDD from scratch requires a mindset shift to think about actual user value (what are you trying to achieve), and then go for the minimum from that perspective. It's basically an inverse from common architecture approach, which is design data models first, and start implementing next. With TDD, you evolve your data models along with the code and architecture.
And it is freaking hard to stop yourself from thinking too far ahead and letting tests drive your architecture (code structure and APIs). Which is why I also frequently prototype without TDD, and then massage those prototypes into fully testable code that could have been produced with TDD.
I think in general people tend to overdo TDD if they do TDD, aiming for a 100% test coverage which just ends up doing what you and parent mentions, solidifies a design and makes it harder to change.
If instead every test is well intentioned and focus on testing the public API of whatever you test, not making assumptions about the internal design, you can get well tested code that is also easy to change (assuming the public interface is still OK).
It's extremely hard to really do TDD and get code that's hard to change. If you persevere with a design that's hard to change, every single change in your failing-test-fix-implementation TDD cycle will make you refactor all your tests, and you'll realise why the design is bad and reduce coupling instead.
What really happens is that people write code, write non-unit "unit" tests for 100% coverage, and then suffer because those non-unit tests are now dependent on more than just what you are trying to test, all of them have some duplication because of it, and any tiny change is now blocked by tests.
You can get 100% coverage by focusing on testing the public API too. These two things are completely orthogonal.
dude, if you have the llm write the tests, then you have no confidence it's testing what you think it is. Making the test worthless
Dude, I was suggesting that you, not the LLM, write the tests in this scenario.
Really puts the auto- in didact! Very curious to hear how this worked for you; it’s almost directly the opposite of the copilot approach.
I learned assembler by typing in listings from magazines and hand dis-assembling and debugging on paper. Your approach seems similar in spirit, but who has the times these days?
I learned this from Zed Shaw's Learn X The Hard Way books. He says this approach is mainstream in other disciplines, like music, languages, or martial arts.
I also heard the philosopher Ken Wilber spent a few years (in what kids today call Monk Mode) writing out great books by hand.
The main effect I noticed is that I rapidly gain muscle memory in a new programming language, library or codebase.
The other effect is that I'm forced to round-trip every token through my brain, which is very helpful as my eyes tend to glaze over — often I'll be looking right at an obvious bug without seeing it.
Just to add to your point, both Mozart and Chopin were known to hand copy JS Bach's well tempered clavier preludes and fugues
I program in neovim with no plugins, no autocomplete and no syntax highlighting. I type everything myself (though I will use copy and paste from time to time). There is a discipline to it that I find very beneficial. As a language designer, it also makes me think very carefully about the syntactic burden of languages that I design. It keeps my languages tight. One of the nice things about typing all of my own code without suggestions is that it eliminates many distractions. I may get some things wrong from time to time, but then I only have myself to blame. And I never waste time messing around with broken plugin configs or irritating syntax highlighting nits.
It's not for everyone but I love it.
Yeah these features are overrated. You can still autocomplete based on existing words in the same file by pressing Ctrl+N, but otherwise just typing it out or copying is totally OK.
I've also experienced autocomplete in NetBeans IDE so slow that it was just faster to type it out.
I quite value syntax highlighting though. Back then I used Turbo Pascal 5.5 on PC XT because it was way faster and less demanding than Turbo Pascal 6.0, but I remember not having a syntax highlighting was quite worse experience. You could get used to do without though.
But it also depends on the language. I've seen some Lua code without syntax highlighting and it was just a soup of words, very unreadable. Whereas something like C with symbols is OK.
I program most languages like this, in emacs.
Wait you took a repo and started typing it into the IDE? Could you please expand on what benefits you noticed and how it affected your understanding of the language? It sounds like a fascinating way to force attention to the code simply reading it wouldn't.
Yeah I just open two panes in Sublime Text, with the source on the right and then I type it out verbatim on the right.
I make an effort to keep the line numbers synced. Sometimes I skip long repetitive blocks or comments. But I do type out like 80% of the actual characters in the file.
It's about 500 lines per hour fot me, so I can estimate reasonably well how long it'll take.
It's not necessarily an efficient thing to do — you'd get way more bang for your buck just poking around, asking questions, trying to make small changes. But for reasonably small projects, you can type it out in a few hours, or a day or two. Then you've "round-tripped" every single token through your brain (though sadly not with a meaningful amount of conscious reflection) -- unless you pause and ask questions along the way.
See also my other comment above.
Not to offend you, and you've already pointed out the better way to do it, but I don't think there is too much to gain from this approach. When I was learning Vulkan for example, the only thing this helped me learn was which functions they were calling from the API. Their variable names and ifdefs and wrapper functions were completely useless to me. I was able to get their 5000 lines down to just 1000-- and that was for a single untextured cube with direct memory management and simple surface handling. Imagine if it had been more complex? 20,000 lines of typing for little reason. My neck aches thinking about it :)
Instead of a book club, have a code typing club, DuoTypo
It would be funny to type it until it builds, and then type it until the tests pass.
If I were to do that (for sqlite), I'd start with main() and re-type stuff needed to run CREATE TABLE/INSERT/SELECT...
>I was typing out entire codebases for didactic purposes
I've read about an author who did this (I can't remember their name right now), writing down the works of another author they wanted to learn from.
Hunter S. Thompson copied The great Gatsby on his typewriter: https://news.ycombinator.com/item?id=24696790
can you elaborate on typing out for didactic purposes, please?
It's a new iteration on the ancient form of learning by copying. I've only ever seen people copy stuff when writing by hand when wanting to memorize something though, I imagine with a keyboard the memory-enhancement effect of writing by hand is lost, but it's probably more effective than just reading alone.
one does not simply type out 120 000 lines of code..
Number of lines does not matter anymore.
Re: copy-on-write (CoW) B-tree vs append-only log + non-CoW B-tree, why not both?
I.e., just write one file (or several) as a B-tree + a log, appending to log, and once in a while merging log entries into the B-tree in a CoW manner. Essentially that's what ZFS does, except it's optional when it really shouldn't be. The whole point of the log is to amortize the cost of the copy-on-write B-tree updates because CoW B-tree updates incur a great deal of write magnification due to having to write all new interior blocks for all leaf node writes. If you wait to accumulate a bunch of transactions then when you finally merge them into the tree you will be able to share many new interior nodes for all those leaf nodes. So just make the log a first-class part of the database.
Also, the log can include small indices of log entries since the last B-tree merge, and then you can accumulate even more transactions in the log before having to merge into the B-tree, thus further amortizing all that write magnification. This approaches an LSM, but with a B-tree at the oldest layer.
I see what you did there... https://news.ycombinator.com/item?id=42711727
Please tell me they did that on purpose as a response to this post. Top-tier banter!
I recently wrote a KV disk change in Rust that uses the latest syscalls: https://github.com/anacrolix/possum
I've read a similar series from Phil back in 2020: "Writing a SQL database from scratch in Go" https://notes.eatonphil.com/database-basics.html
The code is available on GitHub: https://github.com/eatonphil/gosql (it's specifically a PostgreSQL implementation in Go).
It's cool to build a database in 3000 lines, but for a real production-ready database you'll need testing. Would love to see some coverage on correctness and reliability tests. For example, SQLite has about 590 times more test code than the library itself. (https://www.sqlite.org/testing.html)
A tiny database under 256 lines of C: https://gitlab.com/danbarry16/u-database
The idea in mind was to use it for something like an RSS feed reader.
Here's a different approach in Common Lisp I wrote a while back:
https://github.com/codr7/whirlog
I've started learning Velox last year, and it's a staggering amount of code. Sure, it has a ton of dependency because it wants to support so many things, but I feel like the core itself is also very complex.
I'm not sold on complexity being a necessity in software engineering, as I'm sure a lot of you also aren't. Yet we see a lot of behemoth projects.
I've recently started to think about covering some advanced topics like compilers, databases, interpreters. Do u know any good oss repos where I can learn and build them from scratch on my own
Man I got the first edition of this book and it was so bad. Hopefully this is better…
Currently it's on sales at Leanpub with $19.80 but apparently the ebook with the Golang code is $39:
https://leanpub.com/build_your_own_database_from_scratch/c/L...
To be honest it seems a bit strange to pay for the code since all the source codes to build the database is literally inside the book.
Can you elaborate on why it was so bad?
Check the reviews online, but generally random bits of Go code are presented with no build-up or explanation. B Trees are implemented immediately and without explanation. The book is incredibly short and seems just like a loose collection of notes, rather than an actual book. Very interested to check out the new version
Waiting for a post where he writes a distributed database.
You're going to feel awful silly when he does it in like, 3 lines.
related: https://dx.tips/oops-database
Thank you
[dead]