Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.
Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.
It’s impossible to answer that question without knowing what content/query domain you are embedding. Checkout MTEB leaderboard, dig into the retrieval benchmark, and look for analogous datasets.
So we're talking maximizing embedding model per use case? Medical dats would require differnet model than say sales data? Sounds very fragmented approach.
Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?
I say that because it seems n
unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.
Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.
Hey, this looks great! I'm a huge fan of vectors in Postgres or wherever your data lives, and this seems like a great abstraction.
When I write a sql query that includes a vector search and some piece of logic, like:
```
select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10;
```
Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.
It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).
pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.
Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:
// Stores the `title` and `content` fields together as a vector
// in the `content_embedding` vector field
BlogPost.vectorizes(
'content_embedding',
(title, content) => `Title: ${title}\n\nBody: ${content}`
);
// Find the top 10 blog posts matching "blog posts about dogs"
// Automatically converts query to a vector
let searchBlogPosts = await BlogPost.query()
.search('content_embedding', 'blog posts about dogs')
.limit(10)
.select();
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.
Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.
Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).
Good q. For most standalone vector search use cases, FAISS or a library like it is good.
However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.
For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.
Wow, actually a good point I haven't seen anyone make.
Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.
Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.
Post co-author here. The point is a little nuanced, so let me explain:
You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.
The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.
i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.
as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.
In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.
Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.
> the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system
And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.
This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."
...when you actually do want to think about it (in 2024).
Right now, we're collectively still figuring out:
1. Best chunking strategies for documents
2. Best ways to add context around chunks of documents
3. How to mix and match similarity search with hybrid search
4. Best way to version and update your embeddings
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
Great point!
(Disclaimer: I work for Elastic)
Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.
Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.
https://www.elastic.co/search-labs/blog/semantic-search-simp...
I made something similar, but used duckDB as the vector store (and query engine)! It’s impressively fast
https://github.com/patricktrainer/duckdb-embedding-search
There is vector type data available in duckdb now?
How does their embedding model compare in terms of retrieval accuracy to, say `text-embedding-3-small` and `text-embedding-3-large`?
You can use openai embeddings in elastic if you don't want to use their elser sparse embeddings
It’s impossible to answer that question without knowing what content/query domain you are embedding. Checkout MTEB leaderboard, dig into the retrieval benchmark, and look for analogous datasets.
So we're talking maximizing embedding model per use case? Medical dats would require differnet model than say sales data? Sounds very fragmented approach.
Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.
Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.
Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]
[1]: https://github.com/timescale/pgai
Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?
I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.
(post co-author here)
The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.
Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.
Hey, this looks great! I'm a huge fan of vectors in Postgres or wherever your data lives, and this seems like a great abstraction.
When I write a sql query that includes a vector search and some piece of logic, like: ``` select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10; ``` Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.
Here's a bit more of a complicated example of what I'm talking about: https://blog.bawolf.com/p/embeddings-are-a-good-starting-poi...
(post co-author here)
It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).
pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.
[1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...
pg_vector does post-filtering, not pre-filtering
timescaledbs pg_vector_scale extension does pre-filtering thankfully. shame i cant get it in RDS though
I agree.
Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.[0] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...
[1] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...
Whats wrong with using FAISS as your single db?
Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.
Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).
Good q. For most standalone vector search use cases, FAISS or a library like it is good.
However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.
For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.
Wow, actually a good point I haven't seen anyone make.
Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.
Storing documents makes much more sense.
Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.
I used FAISS as it also allowed me to trivially store them together.
Idk how well it scales though, it's just doing it's job on my hobby project scale
For my few 100'000s embeddings I must say the performance was satisfactory.
Safe to say that if you're using off-the-shelf character-based chunking, your AI app is not past PoC.
I’m using sqlite-vec along with FTS5 in (you guessed it) SQLite and it’s pretty cool. :)
Yes. Materialized Views are good.
That was just what I was thinking. This approach will have the same issues that materialized views have as well
haha. We had a good internal debate as to whether this is more like indexes or more like Materialized Views. It's kinda a mixture of the two.
> Vector databases treat embeddings as independent data, divorced from the source data from which embeddings are created
With the exception of Pinecone: Chroma, Qdrant, Weaviate, Elastic, Mongo, and many others store the chunk/document alongside the embedding.
This is intentional misinformation.
Post co-author here. The point is a little nuanced, so let me explain:
You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.
The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.
i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.
as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.
In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.
Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.
> the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system
And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.
This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."
...when you actually do want to think about it (in 2024).
Right now, we're collectively still figuring out:
(post co-author here)
We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].
Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.
Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.
[1]: https://www.timescale.com/blog/which-rag-chunking-and-format...