Ask HN: Is anyone building a question answering system using the HN corpus?

23 points by rahimnathwani 2 years ago

Today, if someone wants to know what the HN community knows/thinks about a topic, they can either:

A) Search past HN comments on hn.algolia.com, or

B) Post a new 'Ask HN'.

LLMs could provide a new way to find answers within a corpus. These have been described elsewhere, e.g.

- https://github.com/openai/openai-cookbook/blob/main/examples...

- https://news.ycombinator.com/item?id=34477543

I keep expecting someone (maybe minimaxir or simonw?) to post a 'Show HN: Get your question answered by the collective wisdom of HN', but I no one has so far (unless I missed the submission?).

Is someone already working on this?

flemhans 2 years ago

I'd love to do this offline, so I could feed it all my mail. Am I right that it's still going to be a while before we can do that? Or perhaps with a less good model than GPT-3?

rahimnathwani 2 years ago

The things I've seen all use hosted language models. For example https://github.com/jerryjliu/gpt_index depends on LangChain, which wraps APIs from hosted LLMs: https://langchain.readthedocs.io/en/latest/reference/modules...
AFAIK there's no GPT-3-like LLM that's easy to run at home, because the number of parameters is so so large. Your gaming PC's GPU won't have enough RAM to hold the model. For example, gpt-neox-20b needs about 40GB of RAM: https://huggingface.co/EleutherAI/gpt-neox-20b/discussions/1...
- flemhans 2 years ago
  
  I wouldn't mind throwing hardware after the problem, but I haven't yet found a "full" guide on how to set it up, it always ends up being something to run in Google Colab or similar.
  
  ttt3ts 2 years ago
  
  Hmm. I have full size bloom running on a server in my basement. It can be ran naively with about 400GB of ram. Using used hardware you can get that for about $1200. Still, with CPU inference, you're looking at about 5 mins per response.
  With optimization, I have it down to 140GB of ram. Trying to get it under 120GB without loosing too much accuracy so it can be ran on standard desktop consumer hardware (who's limits are usually 128GB).
  Given the lack of resources I have found I figured the general intrest was low? Maybe I will open source it and do a write up.
  
  flemhans 2 years ago
  
  I think that would be an incredibly interesting write-up. There are many applications where a 5-min response time would be more than adequate. It could slowly churn through the inbox, while I'm not looking. Or parse customer emails and suggest replies for a rep to potentially use.

olivierduval 2 years ago

Mmmm... and what about copyright ? I mean: may I dump all of HN and then consider it a book to be sold for my own profit ? And if I can't do it... what is the difference between this idea and using HN to train an LLM ? And what if I don't want my comments be parts of this LLM ? Or what about the "trash" accounts that don't want to be identified ?

Don't get me wrong: the idea could be nice but... ain't it time to think twice about all this before applying the last technological fad ?

deadly_syn 2 years ago

Id argue there isnt an equivilency between an LLM and direct dumping the data straight into a book, theres a significant layer of abstraction there from my understanding. It is entirely legal for you to read HN and paraphrase what you learned in a book, which id argue is a much more fair argument.
kleer001 2 years ago

It would easily fall under the auspices of 'fair use'. IANAL.

leobg 2 years ago

Been thinking about this many times. I regularly check what HM things about a specific book, what services HN recommends to perform a particular task, etc..

To the sibling comment that I asked about doing this locally: there’s really no need for an LLM, much less for GPT-3. All you need is, well, attention. Sentence-transformer embeddings. Perhaps even just fastText.

rahimnathwani 2 years ago

AIUI sentence-transformer embeddings work for sentences or short paragraphs. But many comments only make sense in the context of parent/GP comments. This is especially true when a comment answers an earlier question.
I'm not sure how we'd pack enough context into a single 'sentence', to get a useful embedding for this purpose.
(I might be wrong of course.)
- leobg 2 years ago
  
  You keep track of the topic by prepending a summary of the parent. The hierarchical nature of HN should make this somewhat easy.

dyeje 2 years ago

I would assume OpenAI products are already trained on it, amongst many other sites.

gschoeni 2 years ago

Has somebody crawled and made a corpus out of hacker news? Is it maintained?

krapp 2 years ago

HN has an API[0], with a bit of effort you can make one yourself.
[0]https://github.com/HackerNews/API
rahimnathwani 2 years ago

Apparently there are two ways to access it on GCP:
https://github.com/ashish01/hn-data-dumps