Bit thin on details and not looking like they’ll open source it, but if someone clicked the post because they’re looking for their “replace ES” thing:
Both https://typesense.org/ and https://duckdb.org/ (with their spatial plugin) are excellent geo performance wise, the latter now seems really production ready, especially when the data doesn’t change that often. Both fully open source including clustered/sharded setups.
These are great projects, we use DuckDB to inspect our data lake and for quick munging.
We will have some more blog posts in the future describing different parts of the system in more detail. We were worried too much density in a single post would make it hard to read.
These are great. I am eternally grateful that projects like this are open source, I do however find it hard to integrate them into your own projects.
A while ago I tried to create something that has duckdb + its spatial and SQLite extensions statically linked and compiled in. I realized I was a bit in over my head when my build failed because both of them required SQLite symbols but from different versions.
Good point and was mostly re Typesense (can't edit the comment anymore).
But given that duckdb handles "take this n GB parquet file/shard from a random location, load it into memory and be ready in < 1 sec" very well I'd argue it's quite easy to build something that scales horizontally.
We use it for both the importer pipeline that processes the 2B row / 200GB compressed GBIF.org parquet dataset and queries like https://www.meso.cloud/plants/pinophyta/cupressales/pinopsid... and the sheer amount of functions[1] beyond simple stuff like "how close is a/b to x/y" or is "n within area x" is just a joy to work with.
Can you share what makes it better than competitors? And what's great about the dev experience? Did you use their cloud offering? The marketing material looks great, but I want to hear a user's experience.
For me it's a combination 1) solid foundational choices all along, no bolted on vanity features or constant rewrites chasing the latest trend, with everything well documented and 2) incredibly responsive founding team, so you get very quick answers from the people actually building it.
Lol I "love" that the first benefit this company lists in their jobs page is "In-Office Culture". Do people actually believe that having to commute is a benefit?
You can't reduce the in-office or remote experience purely to commuting. It's just one aspect about how and where you work and work life balance in general.
But since you asked, yes, I actually enjoy commuting when it is less than 30 minutes each way and especially when it involves physical activities. My best commutes have been walking and biking commutes of around 20-25 minutes each way. They give me exercise, a chance to clear my head, and provide "space" between work and home.
During 2020, I worked from home the entire time and eventually I found it just mentally wasn't good for me to work and live in the same space. I couldn't go into the office, so I started taking hour long walks at the end of every day to reset. It helped a lot.
That said, I've also done commutes of up to an hour each way by crowed train and highway driving and those are...not good.
I don't get this. This idea that 'work life balance' should mean that the two should be compartmentalised to specific blocks of time seems counterproductive to me. To me it feels like an unnatural way of living. 8 hours in which I should only focus on work, 8 hours I should focus on everything else followed by 8 hours of sleep. I don't think that is how we are supposed to operate. Even the 8 hours of sleep in one block is not natural and a recent invention. Before industrialisation people used to sleep in multiple blocks (wikipedia: polyphasic sleeping)
The idea that you have to be 'on' for 8 hours at a time seems extremely stressful to me. No wonder you need an hour afterwards just to unwind. Interleaving blocks of work and personal time over the day feels much more natural and less stressful to me. WFH makes this possible. If I'm stuck on something, I can do something else for a while, maybe even take a short nap. The ability to focus and do mentally straining work comes in waves for me. Being able to go with my natural flow makes me both happier, more relaxed and more productive.
The key to work/life balance to me is not stricter separation but instead better integration.
> This idea that 'work life balance' should mean that the two should be compartmentalised to specific blocks of time seems counterproductive to me.
Different people are different and can have different preferences.
For me, having different physical spaces helps me focus on work at work and my family at home. When they are the same physical space, both suffer. I'm not saying everyone should feel this way.
Counterpoint: when I "wfh" I end up just sleeping 90% of my work hours and smashing out actual work for the remainder. When I'm in an office I'm productive 70% of my hours and it has nothing to do with accountability, just a proper office environment (and yes I have a work area at home). Regardless of going to office or wfh, I don't have set hours.
The overarching point is everyone is different, ymmv.
This is part of the company culture. If the company respects the boundary between work and personal life, and it's a cultural value, then it shouldn't be a problem for you establishing a space even without going to the office. You just close down your work laptop, put it aside and open it up next time when it's time to work again. Of course, there's stuff like on-call shifts, and there's a temptation to just stay later and finish this one thing, but if the company culture does not expect you to be tethered to work 24x7 then it's doable. If the culture is right, you don't need a physical barrier for this to be doable.
> so I started taking hour long walks at the end of every day to reset. It helped a lot.
A good habit. I dont see why any remote worker couldn't do that.
> it shouldn't be a problem for you establishing a space even without going to the office.
No, this was nothing to do with company culture. This was just my own mental response to just always being at home. Admittedly, the pandemic accentuated this because we weren't going anywhere even on weekends and evenings. But even as things opened up and we resumed our normal socialization, I returned to the office long before most people because I needed the mental and physical distance.
I know I'm atypical. In those early days,I estimated fewer than 5% of people in my office were voluntarily returning and even today when we're at RTO 3 days a week, most people do exactly that and no more.
1. It's extremely cold and dark! I must wear extra clothes when going inside and I get depressed at wasting a day of nice weather in what looks like a WW1 bunker.
2. Terrible accessibility for disabled people! (such as myself)
3. Filthy toilets!
4. Internet is slower than at home!
5. Half the team lives somewhere else so all meetings are on teams anyway!
6. They couldn't afford a decent headset so I get pain in my head after 5 minutes, but I don't have a laptop so I can't move to a meeting room.
The HR really can't understand why after all these great perks I insist on wanting to work from home. I am such an illogical person!
Friend works at office that allows dogs. Her workplace is one big dog toilet! She is expected to clean it (she is not toilet cleaner).
She get sexually assaulted, when her boss shoved his dog into her crotch!
There were some hospitalisations from work related injuries... Regular bullying, threats of violence....
I did quit a job within 2 weeks because of casual racism and sex harassment in the office. But I was lucky to find something else that fast, to be able to do it.
Slightly meta, but I find its a good sign that we're back to designing and blogging about in-house data storage systems/ Query engines again. There was an explosion of these in the 2010's which seemed to slow down/refocus on AI recently.
It slowed down not because of AI, but because it turned out it was mostly pointless. Highly specialized stacks that could usually be matched in performance by tweaking an existing system or scaling a different way.
In-house storage/query systems that are not a product being sold by itself are NIH syndrome by a company with too much engineering resources.
I don't disagree that rock solid is a good choice, but there is a ton of innovation necessary for data stores.
Especially in the context of embedding search, which this article is also trying to do. We need database that can efficiently store/query high-dimensional embeddings, and handle the nuance of real-world applications as well such as filtered-ANN. There is a ton of innovation in this space and it's crucial to powering the next generation architectures of just about every company out there. At this point, data-stores are becoming a bottleneck for serving embedding search and I cannot understate that advancements in this are extremely important for enabling these solutions. This is why there is an explosion of vector-databases right now.
This article is a great example of where the actual data-providers are not providing the solutions companies need right now, and there is so much room for improvement in this space.
I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.
Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot
Or I guess that's why you included the qualifier about money to invest
Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?
Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.
Fair; my question was mostly in the context of ANN, since that was the discussion point - I have to assume ES (as a search engine) would not necessarily be the right tool for data warehousing types of workloads.
Agreed. The only caveat to that being a global rule is: 'At scale in a particular niche, even an excellent generalist platform might not be good enough'
But then the follow on question begs: "Am I really suffering the same problems that a niche already-scaled business is suffering"
A question that is relevant to all decision making. I'm looking at you, people who use the entire react ecosystem to deploy a blog page.
This article is lacking detail. For example, how is the data sharded, how much time between indexing and serving, and how does it handle node failure, and other distributed systems questions? How does the latency compare? Etc. etc.
Author here! We were really motivated to turn a "distributed system" problem into a "monolithic system" from an operations perspective and felt this was achievable with current hardware, which is why we went with in-process, embedded storage systems like RocksDB and Tantivy.
Memory-mapping lets us get pretty far, even with global coverage. We are always able to add more RAM, especially since we're running in the cloud.
Backfills and data updates are also trivial and can be performed in an "immutable" way without having to reason about what's currently in ES/Mongo, we just re-index everything with the same binary in a separate node and ship the final assets to S3.
Why not just use a open source solution like paradedb ... .
Paradedb = postgres pg_search plugin (the base is tantivy). Need anything else like vectors or whatever, get the plugins for postgres.
The only thing your missing is a LSM solution like RocksDB. See Orioledb what is supposed to become a plugin storage engine for postgres but not yet out of beta.
In my experience, the care and feeding that goes into an Elastic Search cluster feels like it's often substantially higher than that involved in the primary data store, which has always struck me as a little odd (particularly in cases where the primary data store is an RDBMS).
I'd be very happy to use simpler more bulletproof solutions with a subset of ES's features for different use cases.
To add another data point: After working with ES for the past 10 years in production I have to say that ES is never giving us any headaches. We've had issues with ScyllaDB, Redis etc. but ES is just chugging along and just works.
The one issue I remember is: On ES 5 we once had an issue early on where it regularly went down, turns out that some _very long_ input was being passed into the search by some scraper and killed the cluster.
I agree, and I don't get where the claims that ES is hard to operate originate from. Yeah, if you allow arbitrary aggregations that exceed the heap space, or if you allow expensive queries that effectively iterate over everything you're gonna have a bad time. But apart from those, as long as you understand your data model, your searches and how data is indexed, ES is absolutely rock-solid, scales and performs like a beast. We run a 35-node cluster with ~ 240TB of disk, 4.5TB of RAM, and about 100TB of documents and are able to serve hundreds of queries. The whole thing does not require any maintenance apart from replacing nodes that failed from unrelated causes (hardware, hosting). Version upgrades are smooth as well.
The only bigger issue we had was when we initially added 10 nodes to double the initial capacity of the cluster. Performance tanked as a result, and it took us about half a day until we finally figured out that the new nodes were using dmraid (Linux RAID0) and as a result the block devices had a really high default read-ahead value (8192) compared to the existing nodes, which resulted in heavy read amplification. The ES manual specifically documents this, but since we hadn't run into this issue ourselves it took us a while to realise what was at fault.
The thing I like about ES: When the business comes around and adds new requirements out of nowhere, the answer is always: "Yup, we can do it!" Unlike other tools such as Cassandra that force a data design from the get go and make it expensive to change later on.
And they can pay the vendors for "bring your own cloud" or similar. If data sovereignty is important to them, then they can probably afford it. And if cost is an issue, then they wouldn't be looking at hosted solutions in the first place.
Nobody is actively looking after it. Good alerting + monitoring and if there's an alert like a node going down because of some Kubernetes node shuffling or a version upgrade that has to be performed one of our few infra people will do that.
It's really not something that needs much attention in my experience.
Are you running customized, or even embedded, versions of your search engine? Highly doubtful. So, GPL is irrelevant. Just run it and interface with it via api.
I'm interested in this detail because a few years back I was involved in a major big data project at a health insurance company and I cooked up a solution that involved ElasticSearch that was workable only to be shot down--it was political, but they had to do it with Kafka, full stop. The problem was, at that time, Kafka wasn't very mature and it wasn't a good solution for the problem, regardless. So our ES version got shelved.
The `/_cluster/reroute` endpoint lets you do that with a curl. We have aliases for common operations so I've never felt that I lack a CLI. I'm happy with Elasticsearch overall having a few years of experience.
Nice... it's cool to see how different companies are putting together best fit solutions. I'm also glad that they at least started out with off the shelf apps instead of jumping to something like a bespoke solution early on.
Quickwit[1] looks interesting, found via Tantivity reference. Kind of like ES w/ Lucene.
Rocks is a fork of Level, and Level is well known for data corruption and other bugs. They are both "run at production scale", but at least back when I worked on stuff that used Level, nobody talked publicly about all the toil spent on cleaning up and repairing Level to keep the services based on it running.
Whenever you see an advertisement like this (these posts are ads for the companies publishing them), they will not be telling you the full truth of their new stack, like the downsides or how serious they can be (if they've even discovered them yet). It's the same for tech talks by people from "big name companies". They are selling you a narrative.
RocksDB diverged from LevelDB a long time ago at this point and has had extensive work done on it by both industry and academia. It's not a toy database like LevelDB was. I can't speak to the problems they're supposedly hiding in their stack, but they are unlikely to come from RocksDB.
This is not my experience. I've been running RocksDB for 4 years on thousands of machines, each storing terabytes of data, and I haven't seen a single correctness issue caused by RocksDB.
It's a bit difficult at the moment, given we have a lot of proprietary data at the moment and a lot of the logic follows it. I'm hoping we can get it to a state where it can be indexed and serving OSM data but that is going to take some time.
That being said, we are currently working on getting our Google S2 Rust bindings open-sourced. This is a geo-hashing library that makes it very easy to write a reverse geocoder, even from a point-in-polygon or polygon-intersection perspective.
There are a few piece of this that rely on proprietary data, especially the FastText training step, so that's a dead-end unfortunately (would love to be proven wrong). I'd consider subbing in a small bert model with a classifier head for something FOSS without access to tons of user data, but then you lose the ability to serve high qps.
My guess is that they're using FastText for semantic search, so it's more likely to break queries like "coffee near me" than address search, the latter likely being handled by tantivy. For context, I've also written a geocoder [0] based on tantivy. :)
Side note 1: ES can also be embedded in your app (on the JVM).
Note 2: I actually used RocksDB to solve many use cases and it’s quite powerful and very performant. If anything from this post take this, it’s open source and a very solid building block.
Note 3: I would like to test drive quickwit as an ES replacement. Haven’t got the time yet.
1 - I think if we were sticking with the JVM, I do wonder if Lucene would be the right choice in that case
2 - It's a great tool with a lot of tuneability and support!
3 - We've been using it for K8s logs and OTEL (with Jaeger). Seems good so far, though I do wonder how the future of this will play out with the $DDOG acquisition.
I really enjoy embedding things in the vm. I run a discord bot with a few thousand users with embedded H2. Recently I’ve been looking at trying to embed keycloak (or something similar) for some other apps.
I did that with ES to squeeze performance but IIRC it didn’t really produce meaningful results. Otherwise, for most use cases an integration is better imho than embedding stuff that is when you have a full software service such as keycloack or ES. Rocksdb and h2 are tailor made as embedded libraries
Would love to know how they scaled it. Also, what happens when you lose the machine and the local db? I imagine there are backups but they should have mentioned it. Even with backups how do you ensure zero data loss.
I've used RocksDB a lot in the past and am very satisfied with it. It was helpful building a large write-heavy index where most of the data had to be compressed on disk.
I'm wondering if anyone here has experience with LMDB and can comment on how they compare?
I'm looking at it next for a project which has to cache and serve relatively small static data, and write and look up millions of individual points per minute.
LMDB is for read-heavy workloads. The opposite of RocksDB.
RocksDB can use thousands of file descriptors at once, on larger DBs. Makes it unsuitable for servers that may also need to manage thousands of client connections at once.
LMDB uses 2 file descriptors at most; just 1 if you don't use its lock management, or if you're serving static data from a readonly filesystem.
RocksDB requires extensive configuration to tune properly. LMDB doesn't require any tuning.
I mean, anything could replace elasticsearch, but can it actually?
It sounds like they had the wrong architecture to start with and they built a database to handle it. Kudos. Most would have just thrown cache at it or fine tuned a readonly postgis database for the geoip lookups.
Without benchmarks it’s just bold claims we’ll have to ascertain.
Yep, their goal was to use a monolite solution, instead of a clustered elasticsearch.
Postgres + pg_search (= tantivy) will have gotten them there for 80%. Sure, postgres really needs a plugin storage engine for better SSD support (see orioledb).
But creating your own database for your own company, is just silly.
There is a lot of money in the database market, and everybody wants to do their own thing to tied customers down to those databases. And that is their main goal.
Bit thin on details and not looking like they’ll open source it, but if someone clicked the post because they’re looking for their “replace ES” thing:
Both https://typesense.org/ and https://duckdb.org/ (with their spatial plugin) are excellent geo performance wise, the latter now seems really production ready, especially when the data doesn’t change that often. Both fully open source including clustered/sharded setups.
No affiliation at all, just really happy camper.
These are great projects, we use DuckDB to inspect our data lake and for quick munging.
We will have some more blog posts in the future describing different parts of the system in more detail. We were worried too much density in a single post would make it hard to read.
These are great. I am eternally grateful that projects like this are open source, I do however find it hard to integrate them into your own projects.
A while ago I tried to create something that has duckdb + its spatial and SQLite extensions statically linked and compiled in. I realized I was a bit in over my head when my build failed because both of them required SQLite symbols but from different versions.
DuckDB does not have any kind of sharding or clustering? It doesn't even have a server (unless you count the HTTP Server Extension)?
Good point and was mostly re Typesense (can't edit the comment anymore).
But given that duckdb handles "take this n GB parquet file/shard from a random location, load it into memory and be ready in < 1 sec" very well I'd argue it's quite easy to build something that scales horizontally.
We use it for both the importer pipeline that processes the 2B row / 200GB compressed GBIF.org parquet dataset and queries like https://www.meso.cloud/plants/pinophyta/cupressales/pinopsid... and the sheer amount of functions[1] beyond simple stuff like "how close is a/b to x/y" or is "n within area x" is just a joy to work with.
[1] https://duckdb.org/docs/stable/core_extensions/spatial/funct...
Duckdb is so absurdly portable you can solve all kinds of performance and scaling concerns this way. It's really a wonderful, fun piece of software.
Motherduck is the spin-off from DuckDBthat does that. But there's also: https://duckdb.org/2025/05/27/ducklake.html
You can also attach DuckDB to Apache Flight which will make it work beyond local operation.
Typesense is an absolute beast, and it has a pretty great dev experience to boot.
Can you share what makes it better than competitors? And what's great about the dev experience? Did you use their cloud offering? The marketing material looks great, but I want to hear a user's experience.
For me it's a combination 1) solid foundational choices all along, no bolted on vanity features or constant rewrites chasing the latest trend, with everything well documented and 2) incredibly responsive founding team, so you get very quick answers from the people actually building it.
Not sure what they'll opensource. The rust code? They're calling it a DB, but they described an entire stack.
Typsense as a product has been great (hosted cluster). Customer support has been awesome as well.
I wonder if this could help Photon, the open source ElasticSearch/OpenSearch search engine for OSM data.
It's a mini-revolution in the OSM world, where most apps have a bad search experience where typos aren't handled.
https://github.com/komoot/photon
A system built on LMDB will work better for this use case than RocksDB. And OSM Express already uses it. https://wiki.openstreetmap.org/wiki/OSM_Express
Lol I "love" that the first benefit this company lists in their jobs page is "In-Office Culture". Do people actually believe that having to commute is a benefit?
You can't reduce the in-office or remote experience purely to commuting. It's just one aspect about how and where you work and work life balance in general.
But since you asked, yes, I actually enjoy commuting when it is less than 30 minutes each way and especially when it involves physical activities. My best commutes have been walking and biking commutes of around 20-25 minutes each way. They give me exercise, a chance to clear my head, and provide "space" between work and home.
During 2020, I worked from home the entire time and eventually I found it just mentally wasn't good for me to work and live in the same space. I couldn't go into the office, so I started taking hour long walks at the end of every day to reset. It helped a lot.
That said, I've also done commutes of up to an hour each way by crowed train and highway driving and those are...not good.
> provide "space" between work and home
I don't get this. This idea that 'work life balance' should mean that the two should be compartmentalised to specific blocks of time seems counterproductive to me. To me it feels like an unnatural way of living. 8 hours in which I should only focus on work, 8 hours I should focus on everything else followed by 8 hours of sleep. I don't think that is how we are supposed to operate. Even the 8 hours of sleep in one block is not natural and a recent invention. Before industrialisation people used to sleep in multiple blocks (wikipedia: polyphasic sleeping)
The idea that you have to be 'on' for 8 hours at a time seems extremely stressful to me. No wonder you need an hour afterwards just to unwind. Interleaving blocks of work and personal time over the day feels much more natural and less stressful to me. WFH makes this possible. If I'm stuck on something, I can do something else for a while, maybe even take a short nap. The ability to focus and do mentally straining work comes in waves for me. Being able to go with my natural flow makes me both happier, more relaxed and more productive.
The key to work/life balance to me is not stricter separation but instead better integration.
> This idea that 'work life balance' should mean that the two should be compartmentalised to specific blocks of time seems counterproductive to me.
Different people are different and can have different preferences.
For me, having different physical spaces helps me focus on work at work and my family at home. When they are the same physical space, both suffer. I'm not saying everyone should feel this way.
Counterpoint: when I "wfh" I end up just sleeping 90% of my work hours and smashing out actual work for the remainder. When I'm in an office I'm productive 70% of my hours and it has nothing to do with accountability, just a proper office environment (and yes I have a work area at home). Regardless of going to office or wfh, I don't have set hours.
The overarching point is everyone is different, ymmv.
> provide "space" between work and home.
This is part of the company culture. If the company respects the boundary between work and personal life, and it's a cultural value, then it shouldn't be a problem for you establishing a space even without going to the office. You just close down your work laptop, put it aside and open it up next time when it's time to work again. Of course, there's stuff like on-call shifts, and there's a temptation to just stay later and finish this one thing, but if the company culture does not expect you to be tethered to work 24x7 then it's doable. If the culture is right, you don't need a physical barrier for this to be doable.
> so I started taking hour long walks at the end of every day to reset. It helped a lot.
A good habit. I dont see why any remote worker couldn't do that.
> it shouldn't be a problem for you establishing a space even without going to the office.
No, this was nothing to do with company culture. This was just my own mental response to just always being at home. Admittedly, the pandemic accentuated this because we weren't going anywhere even on weekends and evenings. But even as things opened up and we resumed our normal socialization, I returned to the office long before most people because I needed the mental and physical distance.
I know I'm atypical. In those early days,I estimated fewer than 5% of people in my office were voluntarily returning and even today when we're at RTO 3 days a week, most people do exactly that and no more.
[flagged]
Or maybe they don't live in US where everything is by car :)
I live in sweden. I assure you public transport commute is no joke.
edit: and if you live outside of the city you'll need a car anyway.
Some people like being in office. People are different.
In-office culture would be dope if there were actual benefits to an office like maybe
Learning from smart people, making friends, free food and drinks, a DDR machine
My last office job had none of that. Instead it was just sort of like a depressing scaled up version of my home office
My office has some nice perks!
1. It's extremely cold and dark! I must wear extra clothes when going inside and I get depressed at wasting a day of nice weather in what looks like a WW1 bunker.
2. Terrible accessibility for disabled people! (such as myself)
3. Filthy toilets!
4. Internet is slower than at home!
5. Half the team lives somewhere else so all meetings are on teams anyway!
6. They couldn't afford a decent headset so I get pain in my head after 5 minutes, but I don't have a laptop so I can't move to a meeting room.
The HR really can't understand why after all these great perks I insist on wanting to work from home. I am such an illogical person!
Friend works at office that allows dogs. Her workplace is one big dog toilet! She is expected to clean it (she is not toilet cleaner). She get sexually assaulted, when her boss shoved his dog into her crotch!
There were some hospitalisations from work related injuries... Regular bullying, threats of violence....
Lovely office culture!
I did quit a job within 2 weeks because of casual racism and sex harassment in the office. But I was lucky to find something else that fast, to be able to do it.
I rather commute than WFH. So yeah, people do. Maybe not all the people, but certainly some people.
Slightly meta, but I find its a good sign that we're back to designing and blogging about in-house data storage systems/ Query engines again. There was an explosion of these in the 2010's which seemed to slow down/refocus on AI recently.
It slowed down not because of AI, but because it turned out it was mostly pointless. Highly specialized stacks that could usually be matched in performance by tweaking an existing system or scaling a different way.
In-house storage/query systems that are not a product being sold by itself are NIH syndrome by a company with too much engineering resources.
Is it good? What's left to innovate on in this space? I don't really want experimental data stores. Give me something rock solid.
I don't disagree that rock solid is a good choice, but there is a ton of innovation necessary for data stores.
Especially in the context of embedding search, which this article is also trying to do. We need database that can efficiently store/query high-dimensional embeddings, and handle the nuance of real-world applications as well such as filtered-ANN. There is a ton of innovation in this space and it's crucial to powering the next generation architectures of just about every company out there. At this point, data-stores are becoming a bottleneck for serving embedding search and I cannot understate that advancements in this are extremely important for enabling these solutions. This is why there is an explosion of vector-databases right now.
This article is a great example of where the actual data-providers are not providing the solutions companies need right now, and there is so much room for improvement in this space.
I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.
Sure they can handle the basic case of ANN. But ANN still doesn’t have good stories for lots of real-world problems.
* filterable ANN, decomposes into prefiltering or postfiltering.
* dynamic updates and versioning is still very difficult
* slow building of graph indexes
* adding other signals into the search, such as query time boosting for recent docs.
I don’t disagree these systems can work but innovation is still necessary. We are not in a “data stores are solved” world.
> Real internet-scale search systems like ES
Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot
Or I guess that's why you included the qualifier about money to invest
Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?
Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.
Fair; my question was mostly in the context of ANN, since that was the discussion point - I have to assume ES (as a search engine) would not necessarily be the right tool for data warehousing types of workloads.
Agreed. The only caveat to that being a global rule is: 'At scale in a particular niche, even an excellent generalist platform might not be good enough'
But then the follow on question begs: "Am I really suffering the same problems that a niche already-scaled business is suffering"
A question that is relevant to all decision making. I'm looking at you, people who use the entire react ecosystem to deploy a blog page.
[dead]
NoSQL/alternative databases became kind of a meme once people realized that 95% of enterprises can do fine with just Postgres.
This article is lacking detail. For example, how is the data sharded, how much time between indexing and serving, and how does it handle node failure, and other distributed systems questions? How does the latency compare? Etc. etc.
It’s interesting as someone in the search space how many companies are aiming to “replace Elasticsearch”
Author here! We were really motivated to turn a "distributed system" problem into a "monolithic system" from an operations perspective and felt this was achievable with current hardware, which is why we went with in-process, embedded storage systems like RocksDB and Tantivy.
Memory-mapping lets us get pretty far, even with global coverage. We are always able to add more RAM, especially since we're running in the cloud.
Backfills and data updates are also trivial and can be performed in an "immutable" way without having to reason about what's currently in ES/Mongo, we just re-index everything with the same binary in a separate node and ship the final assets to S3.
Why not just use a open source solution like paradedb ... .
Paradedb = postgres pg_search plugin (the base is tantivy). Need anything else like vectors or whatever, get the plugins for postgres.
The only thing your missing is a LSM solution like RocksDB. See Orioledb what is supposed to become a plugin storage engine for postgres but not yet out of beta.
Feels like people reinvent the wheel very often.
What was your experience like putting such thing together?
In my experience, the care and feeding that goes into an Elastic Search cluster feels like it's often substantially higher than that involved in the primary data store, which has always struck me as a little odd (particularly in cases where the primary data store is an RDBMS).
I'd be very happy to use simpler more bulletproof solutions with a subset of ES's features for different use cases.
To add another data point: After working with ES for the past 10 years in production I have to say that ES is never giving us any headaches. We've had issues with ScyllaDB, Redis etc. but ES is just chugging along and just works.
The one issue I remember is: On ES 5 we once had an issue early on where it regularly went down, turns out that some _very long_ input was being passed into the search by some scraper and killed the cluster.
I agree, and I don't get where the claims that ES is hard to operate originate from. Yeah, if you allow arbitrary aggregations that exceed the heap space, or if you allow expensive queries that effectively iterate over everything you're gonna have a bad time. But apart from those, as long as you understand your data model, your searches and how data is indexed, ES is absolutely rock-solid, scales and performs like a beast. We run a 35-node cluster with ~ 240TB of disk, 4.5TB of RAM, and about 100TB of documents and are able to serve hundreds of queries. The whole thing does not require any maintenance apart from replacing nodes that failed from unrelated causes (hardware, hosting). Version upgrades are smooth as well.
The only bigger issue we had was when we initially added 10 nodes to double the initial capacity of the cluster. Performance tanked as a result, and it took us about half a day until we finally figured out that the new nodes were using dmraid (Linux RAID0) and as a result the block devices had a really high default read-ahead value (8192) compared to the existing nodes, which resulted in heavy read amplification. The ES manual specifically documents this, but since we hadn't run into this issue ourselves it took us a while to realise what was at fault.
The thing I like about ES: When the business comes around and adds new requirements out of nowhere, the answer is always: "Yup, we can do it!" Unlike other tools such as Cassandra that force a data design from the get go and make it expensive to change later on.
how many clusters, how many indexes and how many documents per index? do you use self hosted es or aws managed opensearch?
12 nodes, 200 million documents / node, very high number of searches and indexing operations. Self-hosted ES on GCP managed Kubernetes.
Lots of other options here if you don't like managing. You can use Elastic cloud, Bonsai.io, and others
A lot of places can't put their data just anywhere.
And they can pay the vendors for "bring your own cloud" or similar. If data sovereignty is important to them, then they can probably afford it. And if cost is an issue, then they wouldn't be looking at hosted solutions in the first place.
They manage it in your GCP project, so you can also make use of your commitments etc.
How big is the team that looks after it?
Nobody is actively looking after it. Good alerting + monitoring and if there's an alert like a node going down because of some Kubernetes node shuffling or a version upgrade that has to be performed one of our few infra people will do that.
It's really not something that needs much attention in my experience.
Check out manticoresearch - it's older than Lucene (which elasticsearch is built on), faster and simpler.
GPLv3 https://github.com/manticoresoftware/manticoresearch/blob/13...
Are you running customized, or even embedded, versions of your search engine? Highly doubtful. So, GPL is irrelevant. Just run it and interface with it via api.
I'm interested in this detail because a few years back I was involved in a major big data project at a health insurance company and I cooked up a solution that involved ElasticSearch that was workable only to be shot down--it was political, but they had to do it with Kafka, full stop. The problem was, at that time, Kafka wasn't very mature and it wasn't a good solution for the problem, regardless. So our ES version got shelved.
In my experience Elastic Search lacks fundamental tooling, like a CLI that copies data between nodes.
The `/_cluster/reroute` endpoint lets you do that with a curl. We have aliases for common operations so I've never felt that I lack a CLI. I'm happy with Elasticsearch overall having a few years of experience.
It is weird to include "Rust" (a language) in the title. Readers might wonder what is replaced by Rust? Elasticsearch or MongoDB?
Nice... it's cool to see how different companies are putting together best fit solutions. I'm also glad that they at least started out with off the shelf apps instead of jumping to something like a bespoke solution early on.
Quickwit[1] looks interesting, found via Tantivity reference. Kind of like ES w/ Lucene.
1. https://github.com/quickwit-oss/quickwit
it's tantivy :)
Rocks is a fork of Level, and Level is well known for data corruption and other bugs. They are both "run at production scale", but at least back when I worked on stuff that used Level, nobody talked publicly about all the toil spent on cleaning up and repairing Level to keep the services based on it running.
Whenever you see an advertisement like this (these posts are ads for the companies publishing them), they will not be telling you the full truth of their new stack, like the downsides or how serious they can be (if they've even discovered them yet). It's the same for tech talks by people from "big name companies". They are selling you a narrative.
RocksDB diverged from LevelDB a long time ago at this point and has had extensive work done on it by both industry and academia. It's not a toy database like LevelDB was. I can't speak to the problems they're supposedly hiding in their stack, but they are unlikely to come from RocksDB.
This is not my experience. I've been running RocksDB for 4 years on thousands of machines, each storing terabytes of data, and I haven't seen a single correctness issue caused by RocksDB.
They're not open sourcing it though?
It's a bit difficult at the moment, given we have a lot of proprietary data at the moment and a lot of the logic follows it. I'm hoping we can get it to a state where it can be indexed and serving OSM data but that is going to take some time.
That being said, we are currently working on getting our Google S2 Rust bindings open-sourced. This is a geo-hashing library that makes it very easy to write a reverse geocoder, even from a point-in-polygon or polygon-intersection perspective.
Could you write a photon replacement if you had that? I would love to spend less per month running photon for my project.
Doesn't sound like it, but it's a nice writeup of the tools they stitched together. For someone to copy and open source... hopefully :)
There are a few piece of this that rely on proprietary data, especially the FastText training step, so that's a dead-end unfortunately (would love to be proven wrong). I'd consider subbing in a small bert model with a classifier head for something FOSS without access to tons of user data, but then you lose the ability to serve high qps.
I guess not having that would only breaking forward geocoding from an address?
My guess is that they're using FastText for semantic search, so it's more likely to break queries like "coffee near me" than address search, the latter likely being handled by tantivy. For context, I've also written a geocoder [0] based on tantivy. :)
[0] https://github.com/ellenhp/airmail
Tempted, specially for switching H3 instead of S2… I prototyped a similar solution a couple of weeks ago, so I could probably do a second pass
What's wrong with S2? H3 is so much more complex for very little gain from what I can tell.
Clicked because of Elasticsearch, then wondered why I hadn't known of radar.com before. Just the autocomplete at a reasonable price that I need.
Side note 1: ES can also be embedded in your app (on the JVM). Note 2: I actually used RocksDB to solve many use cases and it’s quite powerful and very performant. If anything from this post take this, it’s open source and a very solid building block. Note 3: I would like to test drive quickwit as an ES replacement. Haven’t got the time yet.
1 - I think if we were sticking with the JVM, I do wonder if Lucene would be the right choice in that case
2 - It's a great tool with a lot of tuneability and support!
3 - We've been using it for K8s logs and OTEL (with Jaeger). Seems good so far, though I do wonder how the future of this will play out with the $DDOG acquisition.
I really enjoy embedding things in the vm. I run a discord bot with a few thousand users with embedded H2. Recently I’ve been looking at trying to embed keycloak (or something similar) for some other apps.
I did that with ES to squeeze performance but IIRC it didn’t really produce meaningful results. Otherwise, for most use cases an integration is better imho than embedding stuff that is when you have a full software service such as keycloack or ES. Rocksdb and h2 are tailor made as embedded libraries
Would love to know how they scaled it. Also, what happens when you lose the machine and the local db? I imagine there are backups but they should have mentioned it. Even with backups how do you ensure zero data loss.
Is horizondb publicly available for us to try as well..
Searching for HorizonDB I find a Python project on github.
I'm guessing it's closed source *aas only?
I've used RocksDB a lot in the past and am very satisfied with it. It was helpful building a large write-heavy index where most of the data had to be compressed on disk.
I'm wondering if anyone here has experience with LMDB and can comment on how they compare?
https://www.symas.com/mdb
I'm looking at it next for a project which has to cache and serve relatively small static data, and write and look up millions of individual points per minute.
LMDB is for read-heavy workloads. The opposite of RocksDB.
RocksDB can use thousands of file descriptors at once, on larger DBs. Makes it unsuitable for servers that may also need to manage thousands of client connections at once.
LMDB uses 2 file descriptors at most; just 1 if you don't use its lock management, or if you're serving static data from a readonly filesystem.
RocksDB requires extensive configuration to tune properly. LMDB doesn't require any tuning.
I mean, anything could replace elasticsearch, but can it actually?
It sounds like they had the wrong architecture to start with and they built a database to handle it. Kudos. Most would have just thrown cache at it or fine tuned a readonly postgis database for the geoip lookups.
Without benchmarks it’s just bold claims we’ll have to ascertain.
These are not the same kinds of things.
I can see ditching Mongo, but what's bad about ElasticSearch? Too expensive in some way?
Isn't RocksDB just the db engine for Kafka?
Sounds like all they need is Postgres or just Sqlite.
Yep, their goal was to use a monolite solution, instead of a clustered elasticsearch.
Postgres + pg_search (= tantivy) will have gotten them there for 80%. Sure, postgres really needs a plugin storage engine for better SSD support (see orioledb).
But creating your own database for your own company, is just silly.
There is a lot of money in the database market, and everybody wants to do their own thing to tied customers down to those databases. And that is their main goal.
fastText? last time I checked it wasn't even maintained.