Show HN: I put PubMed in a vector DB

pubmedisearch.com

96 points by mpmisko 11 days ago

Hi HN,

As a researcher, I often found myself struggling with the limitations of keyword-based search when exploring PubMed papers. To address this, I created PubMed Search (https://www.pubmedisearch.com/), a tool that leverages a vector database to enable semantic search across medical research literature.

Some key features:

* Daily updates to ensure access to the latest articles

* Semantic search using latest & greatest embedding models

* Some additional useful info about the papers (tldr, journal, publication date, etc.)

Hope you find it useful!

bdangubic 10 days ago

Hey mate, should search by PMID work? Like 35982160 is PMID for "Rare coding variation provides insight into the genetic architecture and phenotypic context of autism" - not seeing this publication at all in search results...

dpifke 10 days ago

Very cool!

Related: the NIST TREC (Text REtrieval Conference) has had several competitions over the years related to improving the searchability of medical data: https://www.trec-cds.org/

If you have novel ideas in this area, you should consider participating. https://trec.nist.gov/

  • mpmisko 9 days ago

    Thanks! Looks quite relevant

lucas_crocker 10 days ago

This is very cool! 2 questions spring to mind:

1. How much did it cost to embed all those vectors and how many articles did you process? PMC is quite large.

2. Could elaborate a little more on your approach to ranking articles? Because I'm familiar with semantic search via embeddings put did you weight those with impact factors/citations? Like how does one even calculate that?

Anyhow, love the idea.

  • mpmisko 10 days ago

    1. We cover all the articles on PMC. The exact cost is hard to estimate because we did a lot of iterations.

    2. We do weight those ... it is a lot of trial and error and you have to have good & exhaustive benchmarks.

rkwz 10 days ago

Congrats on shipping!

I'm curious how the search results rankings work, doesn't look like it's based on date or number of citations, but seems to be deterministic (persists over multiple searches). I did a keyword search using one word.

  • mpmisko 10 days ago

    Thanks!

    It uses a vector search approach. Your query is embedded in a vector space using a language model and we find the closest vector to the query from the PubMed papers. This is a good summary of the techniques: https://learn.microsoft.com/en-us/azure/search/vector-search.... There are a couple more tricks but this is the gist.

    The nice part is that this approach allows you to find relevant papers to your question. E.g, you can ask "Can secondhand smoke cause AMD?" and the very first few papers are answering your question (https://pubmedisearch.com/share/Can%20secondhand%20smoke%20c...). The more specific question, the better. :)

    • victor106 10 days ago

      Cool idea!

      What are some papers labeled "High Quality Article"? How do you determine that?

      • mpmisko 10 days ago

        Just looking at stuff like citations and impact factors of journals.

kkielhofner 10 days ago

Nice!

Out of curiosity what model(s) are you using to generate the embeddings?

  • mpmisko 10 days ago

    Glad you like it! I did this as a mini-project within our startup MediSearch (https://medisearch.io/) & the search pipeline is custom tuned for the problem.

grumpopotamus 10 days ago

What are you embedding exactly? Chunks of documents?

madhatter999 8 days ago

Very promising tool based on a couple of questions I asked it! How did the cleaning of documents look like?

  • mpmisko 8 days ago

    Lots of annoying edge cases as you can imagine, nothing particularly glamorous.

mharig 8 days ago

All 10 thumbs up!

Edit: One suggestion: in the results list, please make the headings links to the articles, too.

  • mpmisko 8 days ago

    Done! Let me know if you have other feedback.

drycabinet 9 days ago

Maybe a stupid question, but how do you compare this against GPT-based search engines?

  • mpmisko 7 days ago

    GPT-based search engines usually use some sort of a database to retrieve context for the LLM to summarize first. This is what people refer to as RAG these days: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge....

    Some of these GPT engines maintain their own vector DB to do semantic search, others are directly hooked into Bing / Google. So pubmedisearch.com would be one component of a GPT-based engine. We actually have a GPT-based engine here: https://medisearch.io/.

alex_duf 10 days ago

What storage did you go for, and what search approach?

  • mpmisko 10 days ago

    We use pinecone and it is not ideal, looking at https://turbopuffer.com/ now. They look quite promising :)

    • yumraj 9 days ago

      Did you compare pinecone against pgvector with Postgres? Self hosted of course

      • cchance 9 days ago

        Isn't it funny how the best Choice somehow always comes back to Postgres in the end XD (for most)

        • yumraj 9 days ago

          Yes, that’s where I’m these days. I don’t even think of venturing outside of Postgres these days, except for say things like Redis etc. where there are mature and established options for specific use cases.

          • mpmisko 8 days ago

            Will definitely check pgvector, thanks for the pointer.

    • cchance 9 days ago

      What kinda dimensions did you keep it relatively low to keep costs down?