A separate comment about conclusions about why they are worse than OpenAI GPT2 - which to me feel to be missing the point.
One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.
The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).
On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.
> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.
At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.
Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!
I really like this article. I hadn't thought that an RTX 3090 would be capable of generating a sort-of decent small LLM from scratch in a reasonable time, but he shows how in detail.
I love the level of detail ( probably, because I see it less and less these days ). It genuinely makes me wonder if anyone tried training LLMs on their own writings ( assuming those bigger than 100+ pages ) and what the results were.
I just want to chime in here about the importance of taking notes and having a journal. These things are now more important than ever as they can literally help fine-tune agents to help assist you using your personal style.
oh definitely. i agree here. can't wait to read the rest of the sentence, probably saying something meaningful about the creative benefits of unstructured writing, or the importance of relying on your own thoughts and language and unique voice in the era of LLMs
> as they can literally help fine-tune agents to help assist you using your personal style.
Is this what tool and die makers used to feel when going to LOC to train their replacements?
Personally, I do not want my likeness to persist after my death, nor do I wish for a company to be able to leverage my likeness after I leave said company.
I understand the concern, but I also think there are benefits to this approach. And while I absolutely agree with you on the likeness part used for a company, at a personal level, I believe it could have a great impact ( and be of use ). And, more importantly, you can then control the disposition of your likeness appropriately ( via an old fashioned will ). As a society, we seem to have solutions for these situations. They were just not very common.
Given the velocity of this industry and it being largely driven by corporations, how many individuals do you think will have control over their likeness vs their likeness being stored by some entity they did not explicitly consent towards?
I appreciate your take, I just think it is not in line with the current trajectory outside of some unique HN posters and the like.
Absolutely. Your model selection has limits of course: best practice for some types of replicable research would be to to use unquantized models, but that still leaves room for smaller Gemma and Llama models.
I’m on a 4080 for a lot of work and it gets well over 50 tokens per second on inference for pretty much anything that fits in VRAM. It’s comparable to a 3090 in compute, the 3090 has 50% more vram, the 4080 has better chip-level support for certain primitives, but that actually matters slightly less using unquantized models, making the 3090 a great choice. The 4080 is better if you want more throuput on inference and use certain common quantize levels.
Training LoRa and fine tunes is highly doable. Yesterday’s project for me, as an example, was training trigger functionality into a single token unused in the vocabulary. Under 100 training examples in the data set, 10 to 50 epochs, extremely usable “magic token” results in under a few minutes at most. This is just an example.
If you look at the wealth of daily entries on arxiv in cs.ai many are using established smaller models with understood characteristics, which makes it easier to understand the result of anything you might do both in your research and in others’ being able to put your results in context.
If you're seriously doing deep learning research, it's very very nice to own your own GPU.
For four years of AI PhD research I worked with a 1050Ti on a personal laptop and a 2060 on a personal desktop. You can do a lot of validation and development on consumer GPUs.
That said, the OP does not train an LLM from scratch on a 3090. That would not be feasible
Research runs on a variety of scales - but "check if this new idea/method/architecture isn't completely dumb on small scale before trying to scale up" is a common enough pattern. And most of those fail on small scale.
Yep, most of what's remaining fails to scale. But it's still a very solid filter.
Sure, there are things that don't work on small scale and then work on large scale. But they're rare, and they sure are going to be expensive to find and validate.
It's good to have a local GPU. That's like your dev environment. Prod is much more expensive in AI programming than in web programming. So you want to make sure everything is working before you push!
> When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet
Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?
I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.
Makes me wonder what kind of model we could get if we just trained on Wikidata and similar datasets, but pre-processed to be natural language rather than just triplets of data.
This is a very nice, detailed post! I have a few minor comments though (maybe a few are discussed somewhere, it's a _long_ article and I can't claim 100% coverage :-) ):
Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)
At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.
The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)
The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654
(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*
> Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
I've always felt the natural way of referring to smaller LLMs would be Medium Language Models and Small Language Models, but I guess MLM is an inauspicious acronym.
It's a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation.
Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.
You seem to be talking about a production-grade model rather than building an LLM as an exercise? Or if not, why do you disagree with the article's example of building a small LLM for $100?
I think I should have replied as a totally separate comment. This is my mistake.
It is nice that the author shared the results of his exercise / experiment. Just got sad as I was reminded (when the 100 USD were mentioned) that all this game is 90%+ about money and hardware rather than skills.
That being said I really like the initiative of the author.
I understand the emotional aspect of feeling like it’s out of reach for you.
Thing is, if you focus on your own skill development and apply it at even a small scale, very few people do that. Then you go for a job and guess what, the company has resources you can leverage. Then you do that, and ultimately you could be in a position to have the credibility to raise your own capital.
That is true for many kinds of software where you need a big amount of resources. No matter how skilled I am, I cannot build Facebook, Google, Photoshop alone. But a tiny version of it just to learn? Why not!
Totally. While the LLM:s today are amazing it is a bit sad that you can’t build SOTA models on your own (vs a few years ago where someone with the skills and access to a dataset could build a state of art models)
In the grand scheme of things, we've only had about a quarter century where you needed a *very* specific kind of problem where prosumer hardware wasn't adequate across computer science as a whole.
It's kind of amazing we got that at all for a while.
To answer the last question: What kind of programming do you do? You are not going to be able to run a model competitive with the SOTA yet; use the cloud. Since you have the budget I'd suggest getting a $20 subscription of each (Claude, Gemini, ChatGPT) so you can lean on their respective strengths.
I used the $200/mo OpenAI subscription for a while, but cancelled when Gemini 3 came out. It was useful for the deep research credits until the Web search gpt got sufficiently good on it's own
A separate comment about conclusions about why they are worse than OpenAI GPT2 - which to me feel to be missing the point.
One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.
The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).
On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.
Anyone interested can also follow this amazing playlist - https://www.youtube.com/playlist?list=PLPTV0NXA_ZSgsLAr8YCgC...
> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.
At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.
Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!
The turbo encabulator lives on.
If you are curious about doing something similar with TPU, Google has an article. https://developers.googleblog.com/train-gpt2-model-with-jax-...
I really like this article. I hadn't thought that an RTX 3090 would be capable of generating a sort-of decent small LLM from scratch in a reasonable time, but he shows how in detail.
I love the level of detail ( probably, because I see it less and less these days ). It genuinely makes me wonder if anyone tried training LLMs on their own writings ( assuming those bigger than 100+ pages ) and what the results were.
I just want to chime in here about the importance of taking notes and having a journal. These things are now more important than ever as they can literally help fine-tune agents to help assist you using your personal style.
> These things are now more important than ever
oh definitely. i agree here. can't wait to read the rest of the sentence, probably saying something meaningful about the creative benefits of unstructured writing, or the importance of relying on your own thoughts and language and unique voice in the era of LLMs
> as they can literally help fine-tune agents to help assist you using your personal style.
oh
Is this what tool and die makers used to feel when going to LOC to train their replacements?
Personally, I do not want my likeness to persist after my death, nor do I wish for a company to be able to leverage my likeness after I leave said company.
I understand the concern, but I also think there are benefits to this approach. And while I absolutely agree with you on the likeness part used for a company, at a personal level, I believe it could have a great impact ( and be of use ). And, more importantly, you can then control the disposition of your likeness appropriately ( via an old fashioned will ). As a society, we seem to have solutions for these situations. They were just not very common.
Given the velocity of this industry and it being largely driven by corporations, how many individuals do you think will have control over their likeness vs their likeness being stored by some entity they did not explicitly consent towards?
I appreciate your take, I just think it is not in line with the current trajectory outside of some unique HN posters and the like.
/r/localllama every once in awhile has such posts; usually very succesful, good results.
Are off-shelf GPUs (like one 3090) suitable for modern academic research on current AI advancements or is it better to rent some cloud compute?
Absolutely. Your model selection has limits of course: best practice for some types of replicable research would be to to use unquantized models, but that still leaves room for smaller Gemma and Llama models.
I’m on a 4080 for a lot of work and it gets well over 50 tokens per second on inference for pretty much anything that fits in VRAM. It’s comparable to a 3090 in compute, the 3090 has 50% more vram, the 4080 has better chip-level support for certain primitives, but that actually matters slightly less using unquantized models, making the 3090 a great choice. The 4080 is better if you want more throuput on inference and use certain common quantize levels.
Training LoRa and fine tunes is highly doable. Yesterday’s project for me, as an example, was training trigger functionality into a single token unused in the vocabulary. Under 100 training examples in the data set, 10 to 50 epochs, extremely usable “magic token” results in under a few minutes at most. This is just an example.
If you look at the wealth of daily entries on arxiv in cs.ai many are using established smaller models with understood characteristics, which makes it easier to understand the result of anything you might do both in your research and in others’ being able to put your results in context.
Just a note, the 4090 is nearly twice as fast as a 3090 for ML training. If that’s what you’re going for I would highly recommend a 4090 over a 3090.
If a 3090 is what you already have, it’s plenty fast. But the 4090 blows it out of the water.
If you're seriously doing deep learning research, it's very very nice to own your own GPU.
For four years of AI PhD research I worked with a 1050Ti on a personal laptop and a 2060 on a personal desktop. You can do a lot of validation and development on consumer GPUs.
That said, the OP does not train an LLM from scratch on a 3090. That would not be feasible
M? The OP literally did train an LLM from scratch in a 3090 (except for the tokenizer), that’s what the whole post is about.
Research runs on a variety of scales - but "check if this new idea/method/architecture isn't completely dumb on small scale before trying to scale up" is a common enough pattern. And most of those fail on small scale.
depressingly enough, things that work on small scale architectures often don't work at larger scales
Yep, most of what's remaining fails to scale. But it's still a very solid filter.
Sure, there are things that don't work on small scale and then work on large scale. But they're rare, and they sure are going to be expensive to find and validate.
It depends on what you want to do in this gigantic field.
It's good to have a local GPU. That's like your dev environment. Prod is much more expensive in AI programming than in web programming. So you want to make sure everything is working before you push!
> When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet
Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?
For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759)
I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.
Makes me wonder what kind of model we could get if we just trained on Wikidata and similar datasets, but pre-processed to be natural language rather than just triplets of data.
If you can create this filtering model, you have created Skynet and solved AGI :D
Data filtering. Dataset curation. Curriculum learning. All already in use.
It's not sexy, it's not a breakthrough, but it does help.
This is a very nice, detailed post! I have a few minor comments though (maybe a few are discussed somewhere, it's a _long_ article and I can't claim 100% coverage :-) ):
Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)
At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.
The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)
The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654
(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*
> Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
I've always felt the natural way of referring to smaller LLMs would be Medium Language Models and Small Language Models, but I guess MLM is an inauspicious acronym.
cool, i was looking for something like this to try on my own puny hw - thanks!
you can train an LLM in the browser, see this demonstration:
https://taonexus.com/mini-transformer-in-js.html
It's a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation.
Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.
I think this is a very valuable exercise if you try to understand how LLMs work and if you have the time.
Sadly to go beyond an exercise, having the money is really what you need if you actually want LLMs now, not time.
Nowadays training very powerful LLMs is easy because all the tooling, source-codes, training datasets, and teaching agents are available.
Getting access to dozens of millions of USD or more is not easy, and for big players this is a just drop in their ocean.
You seem to be talking about a production-grade model rather than building an LLM as an exercise? Or if not, why do you disagree with the article's example of building a small LLM for $100?
I think I should have replied as a totally separate comment. This is my mistake.
It is nice that the author shared the results of his exercise / experiment. Just got sad as I was reminded (when the 100 USD were mentioned) that all this game is 90%+ about money and hardware rather than skills.
That being said I really like the initiative of the author.
I understand the emotional aspect of feeling like it’s out of reach for you.
Thing is, if you focus on your own skill development and apply it at even a small scale, very few people do that. Then you go for a job and guess what, the company has resources you can leverage. Then you do that, and ultimately you could be in a position to have the credibility to raise your own capital.
Play the long game and do what you can do now.
it's skills first and then money and hardware for scale
A more skilled person that understands all the underlying steps will always be more efficient in scaling up due to knowing where to allocate more.
basically... you always need the skills and the money is the fine tuning.
That is true for many kinds of software where you need a big amount of resources. No matter how skilled I am, I cannot build Facebook, Google, Photoshop alone. But a tiny version of it just to learn? Why not!
You could 100% build Facebook. You don’t need any hardcore hardware before you have many users.
Totally. While the LLM:s today are amazing it is a bit sad that you can’t build SOTA models on your own (vs a few years ago where someone with the skills and access to a dataset could build a state of art models)
In the grand scheme of things, we've only had about a quarter century where you needed a *very* specific kind of problem where prosumer hardware wasn't adequate across computer science as a whole.
It's kind of amazing we got that at all for a while.
[dead]
Off topic question since im not a regular here if its ok
Is anyone here actually using the 200$ a month subscriptions with chat gpt or the google 150$ per month ?
Is it worth it for more code generation ? Or spend my money on a couple gpus and go local
To answer the last question: What kind of programming do you do? You are not going to be able to run a model competitive with the SOTA yet; use the cloud. Since you have the budget I'd suggest getting a $20 subscription of each (Claude, Gemini, ChatGPT) so you can lean on their respective strengths.
I used the $200/mo OpenAI subscription for a while, but cancelled when Gemini 3 came out. It was useful for the deep research credits until the Web search gpt got sufficiently good on it's own