fxtentacle 2 months ago

And for the true "Show HN" experience:

wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."

wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."

wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."

wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."

wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."

cat tevr_asr_tool-1.0.0-Linux-x86_64.zip.00* > tevr_asr_tool-1.0.0-Linux-x86_64.zip

unzip tevr_asr_tool-1.0.0-Linux-x86_64.zip

sudo dpkg -i tevr_asr_tool-1.0.0-Linux-x86_64.deb

tevr_asr_tool --target_file=test_audio.wav

and then you'll be greeted with some TensorFlow Lite diagnostics, followed by the intermediate states of the beam-search decoder, followed by the hopefully correct transcription result.

And if that piques your curiosity, here's a short overview over the code: https://github.com/DeutscheKI/tevr-asr-tool#how-does-this-wo...

nshm 2 months ago

This model seems strongly overtrained on CV test set. Usually improvement from LM rescoring is just 10% relative. In the paper https://arxiv.org/pdf/2206.12693.pdf the improvement is from 10.1% WER to 3.64% WER (Table 6). Such a big improvement suggests that LM is biased.

Also, the perplexity of provided ngram LM on CV test set is just 86 and most of 5-gram histories are already in the LM. This also suggests bias.

  • fxtentacle 2 months ago

    Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM. In comparison to that, I found TEVR going from 10.10% to 3.64% unproblematic. The core assumption of my paper is that for German specifically, the language model is very important due to conserved (and usually mumbled) word endings.

    Anyway, it's roughly a 64% reduction for both wav2vec2 XLS-R and TEVR. So if your criticism that I overtrained the TEVR model turns out to be correct, then that would suggest that the Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained, which would still make it a fair comparison w.r.t. the 16% relative improvement in WER.

    Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.

    Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)

    BTW, regardless of the metrics, this is the model that "works for me" in production.

    BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.

    • nshm 2 months ago

      First of all thank you for your nice research! It is really inspiring.

      > Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM.

      It is probably Jonatas Grosman's model, not Facebook. Bias is a common sin for common voice trainers. Partially because they integrate Guttenberg texts into LM, partially because for some languages CV sentences intersect between train and test.

      For comparison you can check Nemo model


      The improvement from LM is from 6.68 to 6.03 as expected.

      > Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained


      > Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.

      Not all the models are overtrained, I mainly complain about German ones. For example Spanish is reasonable:


      > Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)

      Perplexity is a measure of LM quality. See here:

      EVALUATION METRICS FOR LANGUAGE MODELS https://www.cs.cmu.edu/~roni/papers/eval-metrics-bntuw-9802....

      Also for a recent perplexities of transformers see somethinglike

      Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context https://aclanthology.org/P19-1285.pdf

      > BTW, regardless of the metrics, this is the model that "works for me" in production.

      Sure, but it could work even better if you take more generic model.

      >BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.

      Great idea, we'll get there, thank you!

      • fxtentacle 2 months ago

        I now had time and did some testing and the CER is already pretty much excellent for TEVR even without the language model, so it appears to me that what the LM does is mostly to fix the spelling. In line with that, recognition performance is still good for medical words, but in some cases the LM will actually reduce quality there by "fixing" a brand-name to a sequence of regular words.

        Thanks for the perplexity paper :) I'll go read that now.

  • rob74 2 months ago

    Sounds interesting, although as someone not that deeply into ML these terms don't say a lot to me. What would "bias" mean in this case? That the model would recognize a "standard German" speaker, but not someone from Bavaria? Because that happens to a lot of (non-Bavarian) Germans too.

    • nshm 2 months ago

      The model knows the recognition text well and demonstrates good results because of it. If you test the same model on some unrelated speech which model didn't see yet the results will not be that great. Error rate might be significantly worse than other systems.

      • Evidlo 2 months ago

        Wouldn't that more commonly be called generalizability?

kevin_b_er 2 months ago

So, what caught my interest was this quote at the bottom: "Alternatively, you can donate roughly 2 weeks of A100 GPU credits to me and I'll train a suitable recognition model and upload it to HuggingFace."

This takes 2 weeks of A100 GPU time to train? Current GC spot pricing puts that at about $300.

  • fxtentacle 2 months ago

    "This takes 2 weeks of A100 GPU time to train?"

    Yes, but only for me in this specific case. That's because this German model was derived based on a model which was pre-trained for months on hundreds of thousands of hours of YouTube audio in various languages. So if I now train an English recognizer, I don't start from scratch. Instead, I start from a checkpoint which can already almost flawlessly recognize any human utterance. (Or all phonemes in the IPA alphabet, to be more precise)

    The "training" then only learns the mapping between IPA phonemes and text notation.

    • mciancia 2 months ago

      What other languages can you "easily" do like that?

      • fxtentacle 2 months ago

        I believe for English, German, Dutch, Spanish, French, Italian, Portuguese, Polish there's enough data in CommonVoice to get good results.

        • gogusrl 2 months ago

          Any chance for Romanian ?

          • fxtentacle 2 months ago

            For Romanian, I believe someone would first have to collect a large dataset of speech recordings together with groundtruth text. Even if you find cheap narrators working for $10 per hour, that's still $100k for 10k hours of data.

captainmuon 2 months ago

This looks pretty cool, especially since it is offline and free! The only caveat is that it is probably not feasible to train yourself, right?

I should play with this if I have some time. I've had the idea for a while to build a voice assistent which can switch modes or datasets while you are speaking. If you say "computer, play ...." for example it would load a recognizer that is specialized on song names. The idea is that you can mix English song names in a German prompt, and it will not be confused. Every voice assistent I know gets confused, presumably because they convert speech to plain text and only then act on the text.

  • fxtentacle 2 months ago

    > I've had the idea for a while to build a voice assistent which can switch modes or datasets while you are speaking. If you say "computer, play ...." for example it would load a recognizer that is specialized on song names.

    I would say by now the generic recognizers are so good that this is becoming less and less useful. For example, this tool handles non-existing German words quite well.

    That said, the tool has a "--data_folder_path" parameter where you can specify a different acoustic and language model.

    BTW, I also want to build an offline voice assistant :)

    That's how I got started on this journey. You might be interested in my next project, where I try to do offline real-time English recognition with a WebRTC API to make it easy for developers to connect my AI module with their own task logic. Here's the waiting list: https://madmimi.com/signups/f0da3b13840d40ce9e061cafea6280d5...

a2128 2 months ago

This is very impressive work. Could this model be capable of doing streamed input for live speech recognition?

  • fxtentacle 2 months ago

    No. The best you can do with this AI architecture would be something like NVIDIA's NeMo, meaning you group the input audio stream into 2s blocks with 1s overlap and then run the speech recognition on that.

    If you want real-time speech recognition with less than 0.5s of delay between speaking the word and it being fully recognized, then one needs to implement a different architecture. And that one is much more difficult and expensive to train that this one (which was already expensive).

    That said, I want a fully offline and privacy-respecting voice assistant myself.

    So attempting to build the AI for real-time streamed live English speech recognition will be my next project. I plan to ship it as an OpenGL-accelerated binary with WebRTC server so that others can easily combine my recognition with their logic. But it probably won't be free since I'm looking at >$100k in compute costs to build it. In any case, here's the waiting list: https://madmimi.com/signups/f0da3b13840d40ce9e061cafea6280d5...

    • phkahler 2 months ago

      >> But it probably won't be free since I'm looking at >$100k in compute costs to build it.

      How about crowd funding it? Your previous work should be enough to convince people it's worth contributing to.

      • fxtentacle 2 months ago

        Yeah, I'm looking into government programs such as the EU "Prototype Fund", too. But the issue with crowdfunding is that if I want to raise $100k on Kickstarter, I need to spend $10k for an agency and another $20k for ads to promote the campaign. So it's quite wasteful (30% just for marketing) unless you already have a large audience willing to pay, which I don't have.

        So I believe my best bet might be to partner up with a larger company who will pay for development and/or just charging users for a license. Nuance's Dragon Home is $200 and their Pro version is $500, so there's a lot of room for me to be cheaper while still reaching $100k in revenue with a realistic number of users.

        • Birch-san 2 months ago

          You might be eligible to use Google's TPU Research Cloud for free, provided you publicize your results? https://sites.research.google/trc/about/

          Otherwise, perhaps you could ask LAION on #compute-allocation?

          • fxtentacle 2 months ago

            Thanks for those excellent actionable suggestions :)

            Comments like these are why "Show HN" can be so rewarding (despite all the pedantry about the submission title which I can't change anymore anyway).

    • a2128 2 months ago

      Thanks for the information! I've been looking to build an accessibility tool for a deaf community so that they can see live captions of conversations, but some of the existing solutions I've tried seem to lag behind in conversational speech accuracy, or they're difficult/impossible to fine-tune with the community-specific words and phrases.

      • fxtentacle 2 months ago

        This type of "Digital Therapeutics" might be paid for by German health insurance companies. For example, https://gaia-group.com/en/ appears to be a successful provider of medical apps.

        If you don't mind, please email me at moin@deutscheki.de and explain in a bit more detail what the needs of that deaf community are. Maybe I can forward that to the right people to get my government to pay for the app that you wish to see developed.

        I mean I agree with you, it certainly would increase life satisfaction for deaf people if they could "listen in" into conversations to know what others are gossiping about.

fxtentacle 2 months ago

Original poster here, I just wanted to say thank you!

Despite some snark - which I totally deserve, but I can't change the title anymore - I also received some very helpful advice, learned something new, and got introduced to people who plan to use this technology to help others. I can't think of any better outcome for me publishing my research.

Thanks :)

maratc 2 months ago

I'd like to know its accuracy on the sound of Rhabarberbarbara[0] or Nackenzacken[1].



  • marshray 2 months ago

    Bet it'd do just fine - maybe better than humans.

    People just get confused by large numbers of similar syllables because we have to buffer them for more processing. I suspect a pure speech-to-text model doesn't need to worry so much about context and can just take the syllables one by one.

    • fxtentacle 2 months ago

      Actually, this model uses attention layers which are kind of like a query/key=>value look-up and allow the model to merge the knowledge from neighboring time-steps into the current logit prediction.

      The result is that this model performs worse for highly repetitive words, just like humans do.

      • marshray 2 months ago

        This stuff is so fascinating, I really need to learn more about it. :-)

gok 2 months ago

The "TEVR" token encoding idea is interesting. Any chance you compared it with other subword tokenization techniques like BPE?

  • fxtentacle 2 months ago

    Sorry, not really.

    Training this pipeline was already quite expensive so I compared against all models and papers I could find online, but I couldn't afford to train a full new model just to check wav2vec2 with BPE.

    That said, I did check against exhaustively allowing all 1-4 character tokens, which is pretty similar to BPE, and that performed worse in every situation.

    • gok 2 months ago

      Without even actually retraining everything, I would be curious how different the tokens are with your technique compared to using an off-the-shelf solution like SentencePiece with the same output vocabulary size.

BurningPenguin 2 months ago

How well does it do with accent or slight dialect influences?

  • fxtentacle 2 months ago

    We tested it on CommonVoice German, which is what Mozilla used for DeepSpeech German, too. The idea behind that dataset is that arbitrary people on the internet submit their recordings (hence the "common" in the name) and then if enough other people upvote it as "understandable", it gets included in the dataset.

    As such, the AI works well with a variety of accents.

    • junkerm 2 months ago

      I suspect that "works well" means that the model will output words in "official german" and kind of corrects pronounciation errors? I am asking because I had the use case to automatically give feedback to non native german speakers.

      • fxtentacle 2 months ago

        There's a command-line parameter "--use_language_model=0" to disable the spellchecker.

fwsgonzo 2 months ago

Is it possible to get speech recognition into a statically linked executable?

ldd myprogram should say "not a dynamic executable"

Are there any libraries for those of us in embedded?

  • fxtentacle 2 months ago

    My code does precisely that, the only dependencies of the finished executable are Debian OS libraries:

    $ lddtree build/tevr_asr_tool

    tevr_asr_tool => build/tevr_asr_tool (interpreter => /lib64/ld-linux-x86-64.so.2)

        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
        ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2
    • fwsgonzo 2 months ago

      I guess I will give it a try then.

amelius 2 months ago

How many lines for an English version?

And how many lines can be re-used in a version that recognizes both German and English?

tjungblut 2 months ago

I'm sure you can save another line while checking for mückenstiche in your dictionary.

postalrat 2 months ago

And 27 includes plus a huge model.

"284 lines of C++" is something you could fit on a small microcontroller. This isn't 284 lines of C++.

  • fxtentacle 2 months ago

    In my opinion, the focus here should be on the fact that this is a state of the art AI which beats Facebook's wav2vec2 by a relative 16% improvement, Scribosermo (based on NVIDIA NeMo) by a relative 44% and Mozilla's DeepSpeech German by a relative 75% improvement. People usually don't share their production-quality tools ;)

    That said, I wrote "284 lines of C++" to indicate that this is compact enough for people to actually read and understand the source code. Also, compiling my implementation is super easy and straightforward ... something which can't be said for Kaldi, Vosk, or DeepSpeech.

    If you try to read the CTC beam search decoder from Mozilla's DeepSpeech [1], that alone is about 2000 LOC in multiple files. If you try to read the pyctcdecode source that is used by HuggingFace [2], that's 1000+ LOC of Python.

    But this implementation is all the client-side, i.e. the entire "native_client" folder hierarchy in DeepSpeech [3], narrowed down to a mere 284 lines.

    Also, both DeepSpeech and HuggingFace Transformers use TensorFlow as a dependency, i.e. just like me. So in my opinion, it doesn't make sense to include TF in the LOC comparison if all the AI speech recognition systems use it. That would be like including libstdc++, too.

    [1] https://github.com/mozilla/DeepSpeech/tree/master/native_cli...

    [2] https://github.com/kensho-technologies/pyctcdecode

    [3] https://github.com/mozilla/DeepSpeech/tree/master/native_cli...

    • georgia_peach 2 months ago

      "16% better than wav2vec2, 44% better than Scribosermo, 75% better than DeepSpeech" was more than enough for a good headline. Of course everyone was going to get sand in their panties over "284 lines of C++", and now it's time to pay the HN pedantics piper.

      • fxtentacle 2 months ago

        I wanted to specifically highlight that people can (and should) read the source code. I now see that this might have been a mistake. But I was hoping to share the joy of taking a cool tool and looking under the hood.

        • dataflow 2 months ago

          I think you're fine, don't worry about it. It's cool and someone will always complain no matter what you write.

        • xdfgh1112 2 months ago

          You cannot win. The top comments on HN are _always_ pedantry. These people will always find something to complain about or nitpick while completing missing the forest for the trees.

        • jpetso 2 months ago

          I clicked through for the code first, and then got interested in the research. Thanks for doing it the way you did :)

    • kylebgorman 2 months ago

      A speech recognizer is not an "AI".

      • dr_dshiv 2 months ago

        By what definition, principle or authority do you determine what is AI?

        Don't take this the wrong way, but I find that people with more knowledge of the subject tend to be more open about what they include, whereas people with less knowledge tend to do more gatekeeping. AI has a "moving goal post" issue that is notable enough to warrant a wikipedia page: https://en.wikipedia.org/wiki/AI_effect

      • fxtentacle 2 months ago

        Well, the acoustic model is created using deep learning loss-minimization by gradient descent, which is what people usually call AI these days.

woah 2 months ago

284 lines of C++ and how many trained weights?

Touting the "low number of lines" on a neural network project seems kind of silly, since the logic is encoded in the weights. Kind of like if I said "Doom in 5 lines of JS", but those 5 lines just downloaded and ran 130kb of WASM

  • munk-a 2 months ago

    If you want to be really lean I wrote a program with identical performance as the linked one in a single line of shell code.


    Lines of code began and persists as an absolutely awful way to measure anything.

    • fxtentacle 2 months ago

      I mentioned them to make it obvious that this is A LOT LESS code than Mozilla's DeepSpeech, which uses the same dependencies (plus some).

  • fxtentacle 2 months ago

    The novelty here is the TEVR tokens, which is in the c++ beam search decoder.

    Also, https://news.ycombinator.com/item?id=32411566

    • joe__f 2 months ago

      Just delete all the whitespace and you'll have the whole program in one line of code

      • fxtentacle 2 months ago

        Actually, I used "cloc" to calculate the number and I believe it'll split at semicolons to prevent this.

        But great hack anyway :)

        • joe__f 2 months ago

          Hahaha ok you got me

dahfizz 2 months ago

Guys I just wrote a json parser in one line of code!

    import json
  • petercooper 2 months ago

    And all algorithms are O(f(n)), where f is someone else's responsibility.

henrydark 2 months ago

There are so many comments about the "real" length of the program.

The number 284 means something to people who work in speech recognition - these are the people that know how much _they_ write when they try to compete with this library.

This number isn't meant for people who are disinterested (have no stake) in speech recognition.

  • prewett 2 months ago

    "X in N lines of code" is usually used for compressing an algorithm down to its bare essentials. "Ray tracing in 100 lines of code". In that case, it's disingenuous to have 100 lines of glue code that calls into pov-ray. If I compile 284 lines of code, I expect a few kb executable (including libraries that aren't language-libraries like libc). This hasn't really simplified the speech-recognition algorithm, it still requires large amounts of training data and time, and still does the speech recognition by solving a large matrix with Tensorflow. It's not interesting to me that you can hook up Tensorflow in a few lines of code; I know Tensorflow can solve the problem. However, I would be interested in 284 lines of code that replace Tensorflow, which is what the title suggested to me. A better title would be to focus on how the model and/or the handling of the data produces better results, because that's what this code seems to be about.

    • fxtentacle 2 months ago

      If you compare this with Mozilla's DeepSpeech repository, you will find that correctly calling TensorFlow Lite and handling the results in only 284 lines of code is impressive, too.

      Also, the C++ code is mostly a custom beam search decoder based on the research for my paper, so it's not like TensorFlow is doing all the heavy lifting here, because precisely that TEVR token decoder causes the relative 16% performance improvement of this speech recognition AI over others.

codeflo 2 months ago

I don’t want to take anything away from the achievement, because this looks very useful, but the headline misled me a bit. I’m note sure whether it’s my affinity for code golf, but when I see someone bragging about about line count, I don’t expect the use of multi million line nonstandard libraries. :) For anyone wondering, most of the 284 lines are (of course, some might say) calls into Tensorflow. Still, I think this is really nice, just not what I expected from the title.

  • fxtentacle 2 months ago

    Actually, more than half the c++ code is the TEVR token decoder which is based on new research and specific to this project.

    As for the LOC count and excluding TensorFlow, I tried to explain my rationale here: https://news.ycombinator.com/item?id=32411566

  • thaumasiotes 2 months ago

    > I’m not sure whether it’s my affinity for code golf, but when I see someone bragging about about line count, I don’t expect the use of multi million line nonstandard libraries.

    I wouldn't say the headline is "misleading" - no one is about to be fooled into thinking 300 lines of C++ could be capable of state-of-the-art speech recognition. The headline is squarely in the territory of "complete nonsense, but you can tell that without having to read further than the headline".

    • codeflo 2 months ago

      I’m not sure what gives you the confidence to make absolute statements like this. It might be unlikely, but code golfers, demo sceners and the like regularly do crazy stuff with ridiculously little code.

    • eurasiantiger 2 months ago

      Are you saying state-of-the-art speech recognition cannot fit in 284 lines of C++ without any third party libraries?

      • thaumasiotes 2 months ago

        Yes, assuming the lines are of reasonable length.

        • eurasiantiger 2 months ago

          I suppose we have a challenge in our hands.

          • fxtentacle 2 months ago

            As soon as you succeed, I'm pretty sure someone will complain that the sequence of matrix multiplications in the AI parameter file also counts as "code" in the wider sense.

            • thaumasiotes 2 months ago

              Feel free to consider the parameter file "output" that doesn't count against lines of code.

              Anything you used in the process of generating it will be input that does count.

              • fxtentacle 2 months ago

                So 10000+ hours of WAV files...

                • eurasiantiger 2 months ago

                  Download a large amount of random German language videos off YouTube, but only ones with handmade subtitles. Correlate audio with text. Record audio, transform to text.

                  I posit this can be done in less than 284 lines of C++ while having an error rate equal to or better than the state-of-the-art for everyday speech.

                  Gentlemen, ready your putters…

                  • thaumasiotes 2 months ago

                    > Anything you used in the process of generating it will be input that does count.

lynndotpy 2 months ago

To those criticizing the title, how would you improve it? Dependencies do the heavy lifting , but that's true even for a hello-world. It seems obvious enough that I wouldn't see a reason to clarify.

Instead, to me, this reads, "Hey C++ fans, you don't need Python for nice things. Look at what you can do in 284 lines of code!"

I don't have C++ chops, so this is nontrivial for me. I appreciate OP sharing this!

  • UncleEntity 2 months ago

    > To those criticizing the title, how would you improve it?

    By not making false assertions?

    Judging by the comments the 284 lines is in addition to many hours of GPU training time plus some magic and a huge library. I didn’t even click the link because I knew it was 284 lines of plumbing code on top of something else.

    I once wrote a entire renderman compatible renderer in a couple hundred lines of python — which was really a super dumb script which generated ctypes bindings from the header for an actual renderer (pixie if anyone is curious).

    • lynndotpy 2 months ago

      I really don't think it's a false assertion. What code isn't "plumbing"?

      For example, you can get 90% of the way to a CSV parser in Python in one line (as `[line.split(",") for line in open("some.csv").readlines()]`). Should we consider it a false assertion to call that a "one liner" and if so, how should it properly be described?

      • UncleEntity 2 months ago

        —edit- And misread CSV for SVG it seems.

        I actually wrote a generator for a SVG parser/writer and not counting the xml library it came in at around close to 40k lines of code. Or maybe 70k lines, don’t recall just know it is a lot. If I split it up into one class per file it took something like 45 minutes to compile (it is a C-API extension) so I would dump all the code into a single file for faster compiles.

        Haven’t looked at it in a while but the generator file is probably around 400 something lines. Certainly not going to claim its a validating SVG library in 400 odd lines of code.

        Think I had some cockamamie scheme to make a SVG to grease pencil converter for blender and only got around to shaving that one yak.

    • fxtentacle 2 months ago

      The core difference in my opinion is that the 284 lines of code here effect a relative 16% improvement in result quality over what until now was the best publicly available research.

      That's why I wrote "State-of-the-Art" at the beginning of the title. Because this is based on new research and it works better than previous research.

      • UncleEntity 2 months ago

        I don’t doubt you put blood, sweat and tears into this just that the title is misleading.

  • itcrowd 2 months ago

    Show HN: State-of-the-Art German Speech Recognition in C++

    • lynndotpy 2 months ago

      That makes sense, but knowing it's 284 LOC means I can casually read it and learn something from that.

      That would be different at 2840 LOC or very very different at 284000 LOC.

      • fxtentacle 2 months ago

        Yes, this is precisely why I included it. I wanted to highlight for people who have experience with speech recognition in general that this is a magnitude easier to read than Mozilla's DeepSpeech.

  • glouwbug 2 months ago

    Sure, but even something like `int main() {}` relies on an operating system, the C runtime, a standard library, and hardware to boot. Best build the universe to make an apple pie

  • rob74 2 months ago

    Maybe "Show HN: State-of-the-Art German Speech Recognition in 284 lines of C++ glue logic" ?

    • fxtentacle 2 months ago

      That would leave out the fact that those 284 lines contain the new decoder based on our paper which leads to the relative 16% reduction in word error rate.

      Also, it's 3 characters too long to be a valid HN title.

      And lastly, this does contain the parameters for a new AI model which is based on my research, so it's not all "glue logic" ;)

OrangeMonkey 2 months ago

I can do it in 1 line of code if it includes downloading your github repo.

  • mrweasel 2 months ago

    Using operating system provided functionality, a GUI library or anything in libc, isn't something I'd hold against anyone when counting lines of code. Relying on an entire external project however, that's a bit misleading.

    • Someone 2 months ago

      > Using operating system provided functionality

      I’m inclined to accept that, too, but it is a moving goalpost and unfair when comparing between OSes and/or time periods. For example, consumer OSes have been shipping with speech recognition libraries for decades, and they’re getting better and better.

    • hardware2win 2 months ago

      Where is the boundry then?

      You cannot "outsource" the "core"?

      • mrweasel 2 months ago

        I don't know. It's not that it isn't impressive and it is very illustrative of how much you can build, utilizing the tools that are available to all of us. It's just at what point are you writing code and when are you just initializing and configuring an existing tool.

        It's also how you present it. If you said: "Building a TensorFlow backed German speech recognition system in only a few hundred lines of code." Then I feel like you're being more honest.

        • fxtentacle 2 months ago

          I tried to include TensorFlow in the submission headline, but that would have been too long. "Show HN: State-of-the-Art German Speech Recognition in 284 lines of C++" uses up 72 of the 80 character limit. Originally I also wanted to mention that this is and offline, cloud-free, and privacy-respecting, but that, too, didn't fit.

          • andsoitis 2 months ago

            I can appreciate your dilemma, but I think the submission would have been better served if the title emphasizes the real win, which is improved accuracy over some alternatives (like you mention elsewhere in the comment thread), rather than lines of code (because that invites scrutiny on the wrong, non-salient dimension).

      • dahfizz 2 months ago

        You can do whatever you want. Just don't brag about your line count if all your project does is import dependencies.

        • fxtentacle 2 months ago

          It doesn't. It implements a new way of decoding the logits which improves performance by a relative 16% over the previously best German speech recognition, which was Facebook's wav2vec2.

          And the size is relevant to people in the industry because DeepSpeech uses 2000+ LOCs for implementing their decoding, so this works better and is 10x less code.

          • agileAlligator 2 months ago

            > It implements a new way of decoding the logits which improves performance by a relative 16%

            Then you should have mentioned that in the title rather than the LOCs

pkaye 2 months ago

Includes a standard C++ library called tensorflow...

  • fxtentacle 2 months ago

    ... like all the other speech recognition projects that you might want to compare the LOC against.

    • numpad 2 months ago

      ...which however don't clickbait by using LoC as a primary selling point.

      • fxtentacle 2 months ago

        I wouldn't call it clickbait.

        Mozilla's DeepSpeech is so large that you can't really read it and understand it. This one is 10x less code while recognition quality is 75% better (lower relative word error rate).

        So this one is small enough that you can read the source code if you want to, while DeepSpeech is not.

        • numpad 2 months ago

          Good point – calling this clickbait might be too cynical.

          From this perspective I can definitely get behind advertising the project with the LoC measurement. Subjectively, I still find this to be a bit of a "not telling the whole truth", however I also only ever toyed around with speech recognition ai.

    • Evidlo 2 months ago

      You could have maybe said "284 lines of code plus weights" to avoid all this criticism that misses the point of the project.

      • fxtentacle 2 months ago

        That wouldn't have fit into the 80 character limit. I also wanted to mention that it's private and offline, but couldn't say it compact enough.

cgeier 2 months ago

I was quite interested in this, but after seeing that it includes three big other repositories, my interest was immediately gone.

I'm not saying this isn't an interesting project, but perhaps lead with something different?

  • eternalban 2 months ago

    How to assemble a German speech recognition program in under 300 lines.


    ps: I skimmed the paper cited and what I wrote above is -not- correct. The project is not simply assembling a pipeline, it is claiming ('papers with code') an innovation:

    "This paper presents TEVR, a speech recognition model designed to minimize the variation in token entropy w.r.t. to the language model. This takes advantage of the fact that if the language model will reliably and accurately predict a token anyway, then the acoustic model doesn't need to be accurate in recognizing it."


    "We have shown that when combined with an appropriately tuned language model, the TEVR-enhanced model outperforms the best German automated speech recognition result from literature by a relative 44.85% reduction in word error rate. It also outperforms the best self-reported community model by a relative 16.89% reduction in word error rate."

    • fxtentacle 2 months ago

      Yes, we actually did research and spent a significant amount of GPU time. Thanks to my luck of stumbling into the right people at the right time, I could afford cloud-scale training because OVH granted me very generous rebates...

      The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:

      We don't train what you don't need to hear

      If you want to play around with the TEVR tokenizer design, here's the source for that: https://huggingface.co/fxtentacle/tevr-token-entropy-predict...

      • tialaramex 2 months ago

        So does this means the recogniser would be worse at recognising unexpected utterances, which is roughly what you'd see with human recognition ?

        What's the German equivalent of "How to Wreck a Nice Beach"?

        • fxtentacle 2 months ago

          Yes and no. With perfect audio quality, it'll write down almost verbatim what you said. But as the audio gets more noisy, it'll shift more and more towards the most likely interpretation.

        • yorwba 2 months ago

          Eishockey, Kanufahren, Wirsing.

          (Alles okay. Kann noch fahren. Wiedersehen!)

      • cgeier 2 months ago

        > The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:

        > We don't train what you don't need to hear

        This does sound a lot more interesting than the ~280 lines of code.

        • fxtentacle 2 months ago

          For a researcher, yes. But for understanding the trick there, you need to have read and understood the CTC loss paper.

          For people like my industry clients, on the other hand, "code that is easy to audit and easy to install" is a core feature. They don't care about the research, they just want to make audio files search-able.

olliej 2 months ago

284 loc + however many LoC in tensor flow + however many training weights.

I think "X in Y LoC" should be limited to where Y is the LoC to do the work, not the LoC to setup/interface with some other library. We're getting ever closer to "SQL DB in only 200 LoC" that simply forwards to sqlite or some such.

  • fxtentacle 2 months ago

    The unique work that makes this speech recognition superior to other tools is in those 284 lines of code: https://github.com/DeutscheKI/tevr-asr-tool/blob/master/tevr...

    That's a custom-designed beam search decoder implemented in C++ and based on the research for my TEVR paper. It increases performance by a relative 16% reduction in word error rate.

einpoklum 2 months ago

I think the author should probably use some C++ <algorithm>'s in lines 241 and later, avoiding raw loops and making the code clearer.

Also, yeah, it's not 284 lines of C++ by any stretch of the imagination, but the main file is pretty readable, which is nice.

  • fxtentacle 2 months ago

    I just used the Linux CLOC on the C++ files that I wrote.

    Yes, I didn't count my 3 dependencies KenLM, Wave, or TensorFlow, because those are used by pretty much all speech recognition projects. For comparing the complexity of my code to Mozilla's DeepSpeech, it makes sense to ignore the LOCs for shared dependencies.

    • einpoklum 2 months ago

      It was a cute gimmick for the title, nm. But you should try to avoid raw loops in favor of established iteration patterns / standard library algorithms. I found this talk on the subject by Sean Parent to be educational:


      • fxtentacle 2 months ago

        Thanks for sharing the talk :)

vintermann 2 months ago

284 lines of C++, and a 1.53 GB wav2vec2 model. But still very nice!

  • fxtentacle 2 months ago

    Actually the model is TEVR, which is only the wav2vec2 feature extraction but with a modified encoder that improves performance by exploiting redundancies in the German language.

nottorp 2 months ago

And if you add the dependent libraries it's 3 million lines?

ale42 2 months ago

284 lines of C++, and a big external dependency (tensorflow)... plus the model (but this was expected).

nice_byte 2 months ago

> 284 lines of c++

> includes the entirety of tensorflow as a submodule

CamperBob2 2 months ago

LOL, 284 lines of C++, plus:

   #include "absl/flags/parse.h"
   #include "absl/flags/flag.h"
   #include "absl/flags/usage.h"
   #include "absl/flags/internal/commandlineflag.h"
   #include "absl/flags/internal/private_handle_accessor.h"
   #include "absl/flags/reflection.h"
   #include "absl/flags/usage_config.h"
   #include "absl/memory/memory.h"
   #include "absl/strings/match.h"
   #include "absl/strings/str_cat.h"
   #include "absl/strings/string_view.h"
   #include "tensorflow/lite/c/common.h"
   #include "tensorflow/lite/delegates/hexagon/hexagon_delegate.h"
   #include "tensorflow/lite/interpreter.h"
   #include "tensorflow/lite/interpreter_builder.h"
   #include "tensorflow/lite/kernels/kernel_util.h"
   #include "tensorflow/lite/kernels/register.h"
   #include "tensorflow/lite/model_builder.h"
   #include "tensorflow/lite/testing/util.h"
   #include "tensorflow/lite/tools/benchmark/benchmark_utils.h"
   #include "tensorflow/lite/interpreter.h"
   #include "tensorflow/lite/kernels/register.h"
   #include "tensorflow/lite/model.h"
   #include "tensorflow/lite/optional_debug_tools.h"
   #include "kenlm/lm/ngram_query.hh"
   #include "wave/file.h"
   #include "tensorflow/lite/minimal_logging.h"