Show HN: My related-posts finder script (with LLM and GPT4 enhancement)

69 points by tomhazledine 2 years ago

I've open-sourced the script I use to find related blog posts (and to describe why they're similar).

Works on any set of markdown articles, so should fit into any SSG workflow.

Uses embeddings to calculate the similarities, and GPT4 to add descriptive text.

iDon 2 years ago

Extending that idea to the web, or at least to the blogosphere and information / knowledge web-sites, seems useful. I wonder if there is a web service which has calculated vector embeddings for some of the web, and supports vector search, e.g. given a URL, find URLs with similar embeddings. Inverting that, web-sites could annotate their web pages with embeddings via json-ld; which search engines could utilise. Both these ideas might be impractical, e.g. the cost of http GET of the vector might be similar to the cost of calculating the embedding; and the embedding would be only comparable with embeddings from the same model (which would be recorded in the json-ld) so it would age quickly. It would also be subject to SEO gaming, like meta tags.

A quick search didn't find either of these; the closest was this paper which used json-ld to record a vector reduced to 2 dimensions using tSNE : https://hajirajabeen.github.io/publications/Metadata_for_Eme... Metadata standards for the FAIR sharing of vector embeddings in Biomedicine S¸ enay Kafkas et al.

tomhazledine 2 years ago

Yeah - it's a great idea. The size of the embeddings is the big restricting factor IMO. Even with my approach of embedding the entire article, my embeddings index was about the same size as my "regular" search index.
Once you start increasing the granularity of what you're embedding (either by paragraph or sentence) then the old-fashioned search index has a big advantage.
Might be worth it in some scenarios because of the quality of the results. I bet there are places where an embedding search would be more effective by orders of magnitude.
jasonjmcghee 2 years ago

I’m very interested in this. Specifically, I don’t want it to be a web service, I want it to be in a CDN.
Took a crack at building “vector-store-in-a-cdn” last weekend.
https://github.com/jasonjmcghee/portable-hnsw
flir 2 years ago

> It would also be subject to SEO gaming, like meta tags.
Would be easy to mark a vector-providing site as a bad actor, though? Re-run a few of their pages, if you come up with different answers, don't trust them.

TOMDM 2 years ago

Really neat, thank you for posting!

How did you end up handling the case where posts are longer than the context window for the embedding API?

seanhunter 2 years ago

This comment is similar to the comment I wanted to make because I also thought it was pretty nifty.
Joking aside this is pretty cool. One thing about the whole embeddings/cosine similarity thing for people who are struggling with understanding it.
Computers are good at doing lots of sums. Embeddings turn a problem that seems to be about something else[1] into a problem involving lots of sums by turning that something else into numbers.
So when we turn some text into embeddings (numbers), what do those numbers mean? You could imagine a space with a lot of dimensions - the author is using openai embeddings so it's like a thousand dimensions or something - and every point in that space is some embedding, which is actually a numerical representation of the meaning of some text.[2] So things with similar meaning have embeddings which are close to one another in this space. How do you decide what "close" is?
Well one easy way is cosine similarity. Since these are vectors imagine two arrows coming from the origin. To make things simpler imagine it in the two-dimensional plane rather than 1000 dimensions which owuld make your brain leak out of your ears. So you have two arrows going from the origin to some two points. What you want is the length of the line from the tip of one arrow to the tip of the other arrow. For people who struggle to remember their trig, this distance is given by c^2 = a^2 + b^2 - 2ab cos theta. It just so happens that if you take the dot product of two vectors and divide by the product of their norms you get the cosine of the angle between them (cos theta). That's why it's called cosine similarity even though you don't see a cosine in the formula.[3]
[1] language usually in the case of LLMs, but embeddings aren't only about text.
[2] this is why searching embeddings is called semantic search.
[3] The term cosine distance is often used loosely for this although I believe it's actually technically not a distance because it doesn't obey certain properties that are necessary for something to be a distance.
- tomhazledine 2 years ago
  
  Even that simple explanation makes my brain itch a little :D I never did master trig
  I'm curious if there ARE alternative methods to cosine similarity. A lot of the things I've read mention that cosine similarity is "one of the ways to compute distance..." or "a simple way...". But I've not seen any real suggestions for alternatives. Guess everyone's thinking "if it ain't broke, don't fix it" as cosine similarity works pretty darn well
  
  seanhunter 2 years ago
  
  Yeah there are a few other ways. The most common are the “L2 norm”, which would be the hypoteneuse of a right triangle. so if your points are (x1,y1), (x2,y2) then it is sqrt((x1-x2)^2 + (y1-y2)^2)) which you might recognise from Pythagoras’ theorem (c^2 = a^2 + b^2). If you have 1000 dimensions then instead of just twice for x and y you are doing that that a thousand times but the principle is the same.
  Another one is “Manhattan distance” (known as the L1 norm or sometimes as “taxicab distance”), which is just abs(x1-x2)+abs(y1-y2) in that example. If you imagine a set of city blocks and you want to go from one place to another the cab has to go north/south and east/west and can’t go diagonally. That’s the distance it travels. You’re adding up all the North/south parts and the east/west parts.
  There are a bunch of other distance measures eg one project I worked on we used Mahalanobis distance which is a more complex measure which adjusts for dimensions in your space being affected by covariance. That wouldn’t be useful for this particular problem though.
tomhazledine 2 years ago

To be honest I've sidestepped the issue now that GPT4 context size has increased to ~8000 tokens. None of my content goes over that limit.
I did build in a "chunking" mechanism to break down the article into sections if it was over the limit, but I'm not entirely sure how effective the summarisation would be for those... Summarising the first part of an article and then doing a separate summary for the n-th part would probably make confusing results when recombined.
Might be some "prompt magic" that can make it possible, but for the 90%-case I bet you'd get perfectly useable results just by using the first under-limit part of the content. Not tested that idea in the real world yet, though.
thomasfromcdnjs 2 years ago

Piggy backing this comment to simply also express how cool of an idea it is.

Hitton 2 years ago

Cool, but the ultimate result - the related articles in that post - is not great. Suggested articles have barely anything to do with the topic. I think there should be minimal threshold (probably found empirically) for similarity and if no article reaches it, just don't show any.

tomhazledine 2 years ago

Very true! The real limiting factor on the success of this feature is the lack of posts on my blog :D
Works much more effectively for posts about topics I've covered a lot, like SVG or web audio. Happy with the outcomes in general, though, as I can now apply to more content-rich projects.

pabe 2 years ago

Thanks for contributing your work as OSS and writing a comprehensive blog post about it :)

I like the simplicity of your approach without vector DB etc. In case you want to add one, Typesense seems to be a good OSS fit.

tomhazledine 2 years ago

My default approach is to use as few dependencies as possible for Proof of Concept experiments like this one, but it DOES favour convenience over long-term efficiency and performance.
I think a single-author blog is about the limit for what can be handled by my flat-file + working memory approach. Any dataset that's much larger would almost certainly need a vector-friendly DB. Typesense looks like it might be a good fit, and so does the pgvector extension of postgres

cj 2 years ago

FYI, from OpenAI

> Most models that support the legacy Completions endpoint will be shut off on January 4th, 2024.

tomhazledine 2 years ago

Thanks for the heads up, although the legacy shut-off won't affect this script.
Not sure if it's clear from the post or not, but this script uses the newer, recommended, `v1/chat/completions` endpoint rather than the deprecated `v1/completions` endpoint
- cj 2 years ago
  
  Gotcha! Didn’t realize there was a new one to replace the legacy.

gwern 2 years ago

I do something similar with the OA API embeddings on my static website ( https://www.gwern.net ; crummy code at https://github.com/gwern/gwern.net/ ) for the 'similar' feature: call OA API with embedding, nearest-neighbor via cosine, list of links for suggested further reading.

Because it's a static site, managing the similar links poses the difficulties OP mentions: where do you store & update it? In the raw original Markdown? We solve it by transclusion: the list of 'similar' links is stored in a separate HTML snippet, which is just transcluded into the web page on demand. The snippets can be arbitrarily updated without affecting the Markdown essay source. We do this for other things too, it's a handy design pattern for static sites, to make things more compositional (allowing one HTML snippet to be reused in arbitrarily many places or allowing 'extremely large' pages) at the cost of some client-side work doing the transclusion.

I refine it in a couple ways: I don't need to call GPT-4 for summarization because the links all have abstracts/excerpts; I usually write abstracts for my own essays/posts (which everyone should do, and if the summaries are good enough to embed, why not just use them yourself for your posts? would also help your cache & cost issues, and be more useful than the 'explanation'). Then I also throw in the table of contents (which is implicitly an abstract), available metadata like tags & authors, and I further throw into the embeddings a list of the parsed links as well as reverse citations/backlinks. My assumption is that these improve the embedding by explicitly listing the URLs/titles of references, and what other pages find a given thing worth linking.

Parsing the links means I can improve the list of suggestions by deleting anything already linked in the article. OP has so few posts this may not be a problem for him, if you are heavily hyperlinking and also have good embeddings (like I do), this will happen a lot, and it is annoying to a reader to be suggested links he has already seen and either looked at or ignored. This also means that it's easy to provide a curated 'see also' list: simply dump the similar list at the beginning, and keep the ones you like. They will be filtered out of the generated list automatically, so you can present known-good ones upfront and then the similars provide a regularly updated list of more. (Which helps handle the tension he notes between making a static list up front while new links regularly enter the system.)

One neat thing you can do with a list of hits, that I haven't seen anyone else do, is sort them by distance. The default presentation everyone does is to simply present them in order of distance to the target. This is sorta sensible because you at least see the 'closest' first, but the more links you have, the smaller the difference is, and the more that sorting looks completely arbitrary. What you can do instead is sort them by their distance to each other: if you do that, even in a simple greedy way, you get what is a list which automatically clusters by the internal topics. (Imagine there are two 'clusters' of topics equidistant to the current article; the default distance sort would give you something random-looking like A/B/B/A/B/A/A/A/B/B/A, which is painful to read, but if you sort by distance to each other to minimize the total distance, you'd get something more like B/B/B/B/B/B/A/A/A/A/A/A.) I call this 'sort by magic' or 'sort by semantic similarity': https://gwern.net/design#future-tag-features You can then, of course, take a cluster and have GPT-4 write a label describing it.

Additional notes: I would not present 'Similarity score: 79% match' because I assume this is just the cosine distance, which is equal for both suggestions (and therefore not helpful) and also is completely embedding dependent and basically arbitrary. (A good heuristic is: would it mean anything to the reader if the number were smaller, larger, or has one less digit? A 'similarity score' of 89%, or 7.9, or 70%, would all mean the same thing to the reader - nothing.)

> Complex or not, calculating cosine similarity is a lot less work than creating a fully-fledged search algorithm, and the results will be of similar quality. In fact, I'd be willing to bet that the embedding-based search would win a head-to-head comparison most of the time.

You are probably wrong. The full search algorithm, using exact word count indexes of everything, is highly competitive with embedding search. If you are interested, the baseline you're looking for in research papers on retrieval is 'BM25'.

> For each post, the script then finds the top two most-similar posts based on the cosine similarity of the embedding vectors.

Why only top two? It's at the bottom of the page, you're hardly hurting for space.

tomhazledine 2 years ago

Some excellent points in there, thanks!
Really drawn to the idea of storing the final recommendations in a separate file. Didn't write it that way initially because most SSGs handle "data" files in their own unique way, and one of my goals was to be as "light touch" as possible. But I guess that could be solved with some config options (set `path/to/data.json` or similar).
I agree that "79% match" doesn't mean anything on it's own (and is, yes, totally arbitrary in a lot of ways), but it does provide some context when browsing across the whole site. It's a way to indicate that a "90% match" is more similar than a "60% match". Felt like useful info to me, so that's why I included it.
As for "why only top two?" - I'm constantly paranoid about adding too many bells and whistles to my blog and overloading the patience of whoever's taken the time to read. If I had official rules, they'd read "no carousels, and as few Calls To Action on a page as possible". I'm not super strict about it, but 1 recommendation felt too stingy and 3 felt like too many.
BM25 is new to me, and I'm mostly a n00b when it comes to "proper" search. But I'll definitely do some more reading. Currently setting up a head-to-head with an embedding index and a fuzzy-search library, but don't have any scientific way to measure the results. Sounds like you may have pointed me in the direction of a missing piece of the puzzle. Thanks!
- gwern 2 years ago
  
  Most SSGs will copy a literal file without a problem. I dump all the HTML snippets under a /metadata/ prefix and then in Hakyll it's just a matter of doing a literal-copy of '/metadata/*'. (Our convention for annotations or backlinks or similar links is that they live at '/metadata/$TYPE/$ESCAPED_URL.html'. Then you can fill in the necessary links easily in the template as long as you have '$ESCAPED_URL' provided to the template by the SSG.) The real obstacle is that most SSGs want you to do any transclusion at a compile-time, even though this leads to potentially exponential explosions of size, and won't include any JS library for doing client-side transclusion.
  (And you do need a JS library, it's not just a line or two of throwaway code. Client-side transclusion is a bit tricky to get right for use-cases as advanced and general-purpose as ours - we use it for lots of things. Transclude other pages, transclude sections of pages, transclude section ranges, recursive transclusions... Needs to make sure styles get applied, render it off page for acceptable performance, rewrite paths inside the transcluded HTML so links go where you expect them to - that sort of thing.)
  The percent match is also misleading because there is no sense in which it is a percentage. It just isn't. '79% match' is not 1% more similar than '78% match'. My finding with the OA embedding is that a distance of 0.01 actually corresponds to a pretty large semantic distance and after a few more increments, the suggestions are worthless. Also consider this: a distance of 0 (ie. itself) may arguably be '100%' (hard to get more similar than itself!), but then what is a distance like 1? (And can't the cosine distance go higher?) Can you really be '0% similar', never mind '-10% similar'? It is true that 80% is better than 79%, but that's all that means, and you can present that by simply putting them in a list by distance, as you do already.