Ask HN: 16 yo Nephew, in E. Africa, wants to train an LLM with on disk Wikipedia

14 points by a_w 15 days ago

Hello HN!

My 16 year old nephew lives in an East African nation where there is practically no internet access.

Last week he asked me for advise as to how to go about training an open source LLM using an on disk Wikipedia (~80 GB).

Any suggestions? Thanks!

runjake 14 days ago

In addition to the other great suggestions, point him to Karpathy's YouTube channel[1]. Karpathy has an approachable communication style.

Here's his "1 hour intro to LLMs" video: https://www.youtube.com/watch?v=zjkBMFhNj_g

1. https://www.youtube.com/c/AndrejKarpathy

a_w 13 days ago

Thanks!
I will try to download it and send it to him.

FrenchDevRemote 13 days ago

Not an expert, but maybe using RAG/embeddings on the on-disk wikipedia would be better than finetuning on wikipedia?

Most decent LLMs probably were already trained on wikipedia, that doesn't stop them from hallucinating when asked questions about it.

a_w 13 days ago

Thanks for the suggestion! I will look into this.
more_corn 9 days ago

^ This is the way

icsa 15 days ago

Use a model already trained on Wiki[edia using llamafile.

You can download llamafile and several models, put them on a USB drive or hard drive, them send the drive to him via DHL.

a_w 15 days ago

That is a great suggestion, thank you!
I think he wants to tinker, and learn more about how they work. What I neglected to mention is that he's already learned to program (developing Android apps, and he's also learned Python). He is a very bright and curious kid.
- icsa 14 days ago
  
  Have him check out:
  LLM training in simple, raw C/CUDA
  ----------------------------------
  https://github.com/karpathy/llm.c
  It is only 1,000 lines of easy to read C code. There is also Python reference code.
- icsa 12 days ago
  
  Btw, I support some Kenyan high school students and am looking at supplying a few schools with llamafile+models on flash drives for their computer science curricula.
  
  a_w 11 days ago
  
  That's interesting. Could you expand on this a bit more? Which models, and I am curious about how the CS teachers/students will be using this?
  
  icsa 9 days ago
  
  I'm reviewing models, at the moment. Model selection will depend greatly on the hardware capabilities at each school. Phi-3 could be a good starting point.
  The project is an idea at the moment. My contact in Kenya has direct access to the Principals of the schools that our supported students attend.
  My thought is that the teachers would not have to do much. Many of the students already know python and could do self-learning individually or in groups.
  A flash drive with llamafile+models and documentation might be all that it would take to get them started - even offline.
  Bonus: Using llamafile, the same binary distribution works on MacOS, Linux, and Windows.
  
  a_w 8 days ago
  
  Thanks for the detailed response.
  I wasn't aware of Phi-3 - I will look into it.

throwaway11460 10 days ago

Would it be possible to ship him a Starlink terminal? Internet access could do wonders for a young interested guy like that... And he could share that connectivity with people around him too.

a_w 5 days ago

Unfortunately, it doesn't look like Starlink is an option.
https://news.ycombinator.com/item?id=40246021
a_w 8 days ago

I have been thinking about that, but I haven't gotten around to researching its availability in the country yet.
I will do some research over the weekend. Thanks for mentioning it!

joegibbs 15 days ago

What kind of GPUs does he have?

a_w 15 days ago

I believe he has a laptop with an Intel i5 with integrated graphics.