A Replacement for BERT

348 points by cubie 6 months ago

jph00 6 months ago

Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :) We're very excited about this model release -- it feels like it could be the basis of all kinds of interesting new startups and projects.

In fact, the stuff mentioned in the blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.

Anyhoo, if anyone has any questions, feel free to ask!

ZQ-Dev8 6 months ago

Jeremy, this is awesome! Personally excited for a new wave of sentence transformers built off ModernBERT. A poster below provided the link to a sample ST training script in the ModernBERT repo, so that's great.
Do you expect the ModernBERT STs to carry the same advantages over ModernBERT that BERT STs had over the original BERT? Or would you expect caveats based on ModernBERT's updated architecture and capabilities?
- jph00 6 months ago
  
  Yes absolutely the same advantages -- in fact the maintainer of ST is on the paper team, and it's been a key goal from day one to make this work well.
- data_ders 6 months ago
  
  what’s ST stand for here? I googled and only got results for BERT STS (semantic text similarity)
  
  bclavie 6 months ago
  
  Sentence Transformers (https://sbert.net/), the most used library for embedding models (similarity, retrieval.)
derbaum 6 months ago

Hey Jeremy, very exciting release! I'm currently building my first product with RoBERTa as one central component, and I'm very excited to see how ModernBERT compares. Quick question: When do you think the first multilingual versions will show up? Any plans of you training your own?
newfocogi 6 months ago

Thank you so much for doing this work. I expect many NLP projects and organizations are going to benefit from this, and I'm looking forward to all the models that will be derived from this. I'm already imagining the things I might try to build with it over the holiday break.
Tiny feedback maybe you can pass along to whoever maintains the HuggingFace blog — the GTE-en-MLM link is broken.
https://huggingface.co/thenlper/gte-en-mlm-large should be https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base
- bclavie 6 months ago
  
  Thank you! We're fixing the link.
querez 6 months ago

Two questions:
1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?
2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?
- bclavie 6 months ago
  
  Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.
  For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.
  For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!
- cubie 6 months ago
  
  Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.
- janalsncm 6 months ago
  
  On your second point, most modern LLMs are decoder only. And as for why adding a classification head isn’t optimal, the decoders you’re referring to have 10x the parameters, and aren’t trained on encoder-type tasks like MLM. So there’s no advantage on any dimension really.
- yorwba 6 months ago
  
  Llama and Mistral are decoder-only models; there is no encoder you could put a head on.
  You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.
- spott 6 months ago
  
  ModernBERT-Base is larger than BERT-Base by 39M parameters.
bertobugreport 6 months ago

Trying to fine tune on single rig multi-gpu and it crashes, going back down to 1 GPU fixes and training continues (excited to see its results).
Script is near identical with the one below, updated with new imports;
https://huggingface.co/docs/transformers/en/tasks/token_clas...
bomewish 6 months ago

I can't find any info on whether ModernBERT will handle languages other than English; German, Chinese, Arabic? Any info there would be super helpful.
- authorfly 6 months ago
  
  Probably a multilingual version will be needed, like with BERT and RoBERTa. I should hasten to add for multi language tasks(beyond detection), either simpler methods for tasks like multiple language classification/prediction(e.g. word frequency, BERTopic like approaches or SVMs) or LLMs are generally a better candidate.
  There are a couple of reasons.. 1) That size (even for the large) is too much for multiple languages with good BLEU scores. 2) Encoder and decoder models don't tend to get trained for translation as much as e.g. GPT models with large translation texts in their datasets across multiple languages (with exceptions such as T5 translation task).
  
  bomewish 6 months ago
  
  Looking to do super fast embeddings, basically. A few chinese teams seem to have produced some BERT variants so I’ll look there.
geekodour 6 months ago

Hi Jeremy, I am trying to navigate the space and trying to understand what fits where.
Could you shed some lights on what parts of bge-m3 would modernbert overlap with or would this is comparing apples to oranges?
https://huggingface.co/BAAI/bge-m3
- bclavie 6 months ago
  
  Hey! It’s more like comparing apples to apple pie.
  BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.
  ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.
TheTaytay 6 months ago

Thank you for this. I can't wait to try this, especially on GLiNER tasks.
LunaSea 6 months ago

Hi Jeremy, do you have plans to adapt this model for different languages?

janalsncm 6 months ago

> encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models

This is partially because people using decoders aren’t using huggingface at all (they would use an API call) but also because encoders are the unsung heroes of most serious ML applications.

If you want to do any ranking, recommendation, RAG, etc it will probably require an encoder. And typically that meant something in the BERT/RoBERTa/ALBERT family. So this is huge.

llm_trw 6 months ago

Encoders are suffering from the curse of all successful AI applications: they work so they are no longer AI.
Excited about trying this out, less excited about recalculating a petabyte worth of embedding if it's as good as it looks like it will be. At least I can keep my house warm.
- martin82 6 months ago
  
  Kinda curious what kind of data you have lying around there and what stack you use to create the embeddings and keep them up to date and how you use then...
  
  llm_trw 6 months ago
  
  Officially financial data. Unofficially every textbook and science paper ever published. Email me if you're interested.
EGreg 6 months ago

Can you go into detail for those of us who aren't as well versed in the tech?
What do the encoders do vs the decoders, in this ecosystem? What are some good links to learn about these concepts on a high level? I find all most of the writing about different layers and architectures a bit arcane and inscrutable, especially when it comes to Attention and Self-Attention with multiple heads.
- cubie 6 months ago
  
  On a very high level, for NLP:
  1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).
  2. a decoder takes an input (e.g. text), and then extends the text.
  (There's also encoder-decoders, but I won't go into those)
  These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.
  In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.
  
  SoothingSorbet 6 months ago
  
  I still find this explanation confusing because decoder-only transformers still embed the input and you can extract input embeddings from them.
  Is there a difference here other than encoder-only transformers being bidirectional and their primary output (rather than a byproduct) are input embeddings? Is there a reason other than that bidirectionality that we use specific encoder-only embedding models instead of just cutting and pasting a decoder-only model's embedding phase?
  
  craigacp 6 months ago
  
  The encoder's embedding is contextual, it depends on all the tokens. If you pull out the embedding layer from a decoder only model then that is a fixed embedding where each token's representation doesn't depend on the other tokens in the sequence. The bi-directionality is also important for getting a proper representation of the sequence, though you can train decoder only models to emit a single embedding vector once they have processed the whole sequence left to right.
  Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.
  
  Kinrany 6 months ago
  
  How much does the choice of the encoder depend on the application?
- janalsncm 6 months ago
  
  If you’re interested in learning more, the linked article isn’t a bad place to start.
lethibo 6 months ago

[dead]

shahjaidev 6 months ago

The community would benefit a lot from a multilingual ModernBERT. Pretraining on a multilingual corpus is crucial for a ranking/retrieval model to be deployed in many industry settings.Simply extending the vocab and fine tuning the en checkpoint won’t quite work. Any plans to release a multilingual checkpoint ?

deepsquirrelnet 6 months ago

I read your paper this morning, and am just thrilled with the work. Love the added local attention layers. I’ve experimented with them for years (lucidrains repo), and was always surprised they didn’t go further. Inference speeds are awesome on this model. Scrapping NSP, awesome. Increased masking, awesome. RoPE and longer context, again, bravo. There’s so many great incremental improvements learned over the years and you guys made so many good decisions here.

I’d love to distill a “ModernTinyBERT”, but it seems a bit more complex with the interleaved layers.

anon373839 6 months ago

> I’d love to distill a “ModernTinyBERT
That’s a question I’m interested in as well! DistilBERT and friends have been terribly useful at the edge. I wonder if/when we may see something similar for ModernBERT.

jbellis 6 months ago

Looks great, thanks for training this!

  - Can I fine tune it with SentenceTransformers?
  - I see ColBERT in the benchmarks, is there an answerai-colbert-small-v2 coming soon?

jph00 6 months ago

The creator of answerai-colbert-small-v2 (bclavie) is also the person that launched the ModernBERT project, so yes, you can expect to see a lot of activity in this space! :D
(Also yes, it works great with ST and we provide a full example script.)
gunalx 6 months ago

Seems like it. They even have example training scripts available. https://github.com/AnswerDotAI/ModernBERT/blob/main/examples...
Check out their documentation page linked on the bottom of the article. https://huggingface.co/docs/transformers/main/en/model_doc/m...

mark_l_watson 6 months ago

I saw this early this morning. About for or five years ago I used BERT models for summarization, etc. BERT seemed like a miracle to me back then.

I am going to wait until Ollama has this in their library, even though consuming HF is straight forward.

The speedup is impressive, but then so are the massive speed improvements for LLMs recently.

Apple has supported BERT models in their SDKs for Apple developers for years, it will be interesting to see how quickly they update to this newer tech.

wenc 6 months ago

Can I ask where BERT models are used in production these days?

I was given to understand that they are a better alternative to LLM type models for specific tasks like topic classification because they are trained to discriminate rather than to generate (plus they are bidirectional so they can “understand” context better through lookahead). But LLMs are pretty strong so I wonder if the difference is negligible?

vietvu 6 months ago

LLMs like GPT are heavy and costly (and BERT are LLMs too, params can up to like 1.5B). For niche problems like classification on a small domain, BERT like models are much better, cheaper. You don't need all knowledge gen AI LLM has. I have seen many companies using DeBERTa or RoBERTa for text classification, not using GPT/LLaMA.
ganeshkrishnan 6 months ago

LLMs dont have the same usecase as encoder only models. Lets assume you have around million keywords and you want to find the most similar to a keyword that the user input.
In pre-processing you would have calculated the vector encoding of all the million keywords before hand and now with the keyword the user input, you calculate the vector and then find the most similar vectors
LLM is used by end user, encoders are used by devs in app to search/retrieve text.

dmezzetti 6 months ago

Great news here. Will takes some time for it to trickle downstream but expect to see better vector embeddings models, entity extraction and more.

cubie 6 months ago

Spot on

pantsforbirds 6 months ago

Awesome news and something I really want to checkout for work. Has anyone seen any RAG evals for ModernBERT yet?

cubie 6 months ago

Not yet - these are base models, or "foundational models". They're great for molding into different use cases via finetuning, better than common models like BERT, RoBERTa, etc. in fact, but like those models, these ModernBERT checkpoints can only do one thing: mask filling.
For other tasks, such as retrieval, we still need people to finetune them for it. The ModernBERT documentation has some scripts for finetuning with Sentence Transformers and PyLate for retrieval: https://huggingface.co/docs/transformers/main/en/model_doc/m... But people still need to make and release these models. I have high hopes for them.

readthenotes1 6 months ago

I guess the next release is going to be postmodern bert.

carschno 6 months ago

The model cars says only English, is that correct? Are there any plans to publish a multilingual model or monolingual ones for other languages?

amunozo 6 months ago

Yes, the paper says that is only English.

303bookworm 6 months ago

Really excited to see this! 2 Questions: 1. Did you try using RTD (Electra like pretraining)? Or did you skip that for reasons of compatability? 2. Why not incorporate jamba like Mamba2 alternating layers?

Labo333 6 months ago

Sad that it is English only, not multilingual.

GaggiX 6 months ago

It would be really cool to have a model like this but multilingual, it would really help with things like moderation.

neodypsis 6 months ago

How does it compare to Jina V3 [0], which also has 8192 context length?

0. https://arxiv.org/abs/2409.10173

bclavie 6 months ago

They perform different roles, so they're not directly comparable.
Jina V3 is an embedding model, so it's a base model, further fine-tuned specifically for embedding-ish tasks (retrieval, similarity...). This is what we call "downstream" models/applications.
ModernBERT is a base model & architecture. It's not supposed to be out of the box, but fine-tuned for other use-cases, serving as their backbone. In theory (and, given early signal, most likely in practice too), it'll make for really good downstream embeddings once people build on top of it!

vietvu 6 months ago

So that what's Jeremy Howard was teasing about. Nice one.

crimsoneer 6 months ago

Answer.ai team are DELIVERING today. Well done Jeremy and team!

zelias 6 months ago

missed opportunity to call it ERNIE

timClicks 6 months ago

More generally, using the prefix "Modern" haunts every product name that uses it. Technologies move fast and modern becomes antiquated very quickly.
- bclavie 6 months ago
  
  We had a bit of a discussion around it, but I figured that 6 years warranted the prefix, and it's easier to remember in the sea of new acronyms popping up everyday.
  Besides, PostModernBERT will be there for us for the next generational jump.
- int_19h 6 months ago
  
  It'll just get shortened to Mobert in the long run anyway.
lrog 6 months ago

yep, too late: https://huggingface.co/docs/transformers/en/model_doc/ernie
chriswarbo 6 months ago

Tangentially:
ERNIE is probably the most famous "computer" in the UK, which has been picking winners for the UK's premium bonds scheme since the 1950s. It was heavily marketed, to get the public used to the new-fangled idea of electronics, and is sometimes considered one of the first computers; though (a) it was more of a special-purpose random number generator rather than a computer, and (b) it descended from the earlier Colossus code-breaking machines of World War II (though the latter's existence was kept secret for decades). The latest ERNIE is version 5, which uses quantum effects to generate its random numbers (earlier versions used electrical and thermal noise).
https://en.wikipedia.org/wiki/Premium_Bonds#ERNIE
amrrs 6 months ago

I remember back in the day there was an Ernie model
- axpy906 6 months ago
  
  Don’t forget ELMO. The bi-lstm.
behnamoh 6 months ago

I never liked the names BERT and its derivatives. Of all the names on the world, they chose words that are ugly, specific to one culture, and frankly childish.
- Cthulhu_ 6 months ago
  
  Sesame Street has been broadcast in 140 countries; Bert (and Ernie) have been localized to 18 languages, including Arabic, Hindi, Japanese, Hebrew and Chinese, with China having an AI called ERNIE because of course.
  Or to make an overly worded / researched reply to a petulant comment short, they are very much not specific to one culture.
  
  Dalewyn 6 months ago
  
  >they are very much not specific to one culture.
  It's American culture.
  Or as Civilization would put it: We made them buy our blue jeans and listen to our pop music.
  Note: I disagree with all the other points like "ugly" and "childish".

Arcuru 6 months ago

I'm not sure I am understanding where exactly this slots in, but isn't this an embedding model? Shouldn't they be comparing it to a service like Voyage AI?

- https://docs.voyageai.com/docs/embeddings

janalsncm 6 months ago

You’re comparing SaaS to open weights. A SaaS will never compete on the flexibility of adding a classification head to BERT (where the gradients flow all the way back), training it, knowledge transferring to a similar domain, distilling it down, pruning layers, fine-tuning some more, etc. which is a common ML workflow.
spott 6 months ago

Embedding models are frequently based on Bert style models, but Bert models can be finetuned to do a lot more than just embeddings.
So an embedding focused finetune of modern Bert should be compared to something like voyageai, but not modern Bert itself.
- KTibow 6 months ago
  
  What are the people who keep downloading Bert doing then? Are they the minority who directly use it for embeddings?
  
  janalsncm 6 months ago
  
  They are probably fine tuning on their own particular downstream tasks, either for embeddings or as a component of a larger model.
  
  spott 6 months ago
  
  I’m honestly not sure why Bert-based-uncased is so popular… the model isn’t that useful on its own. From their huggingface page:
  > You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions of a task that interests you.
  > Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.
  
  strangecasts 6 months ago
  
  I think this comes down to the Huggingface libraries defaulting to downloading the model from HF if they cannot locate the weights - so "make your own text classifier" tutorial notebooks default to bert-based-uncased as a "standard" pretrained encoder you can put a classification head on top of and finetune, and in turn people run them in Google Colab and just download another copy of the weights on startup, which counts towards the total
  
  metanonsense 6 months ago
  
  I am out of the game for a year or so (and was never completely in the game), but back then BERT was the basis for lots of interesting applications. The original Vision Transformer (ViT) was based (or at least inspired by) BERT, it was used for graph transformers, visual language understanding, etc.