For those too impatient to read the details, check out the "Hear for yourself" examples toward the bottom of the page. They're reproducing decent sounding speech at 1.6 kbps.
1.6 kbps is nuts! I like to re-encode audio books or podcasts in Opus at 32 kbps and I consider that stingy. The fact that speech is even comprehensible at 1.6 kbps is impressive. As the article explains, their technique is analogous to speech-to-text, then text-to-speech.
The original recordings are a little stiff, and the encoded speech is a little more stiff. It isn't perfect, but it's decent. It'll be interesting to hear this technique applied to normal conversation. If regular speech holds up as well as their samples, it should be perfectly adequate for conversational speech. At 1.6 kbps, which is absurd.
Also, I wonder how well this technique could be applied to music. My guess is that it won't do justice to great musicians ... but it might be good enough simple pop tunes.
I’ve done experiments with Opus that produce intelligible (but ugly) speech at 2.3 kbit/s. It involves downsampling the Opus stream at the packet level—e.g. transmit only one out of every three packets. It was surprisingly easy. Nothing as sophisticated as what’s going on here.
Also based on work by Xiph. Possibly using the same LPC used here.
Actually, this won't work at all for music because it makes fundamental assumptions that the signal is speech. For normal conversations, it should work, though for now the models are not yet as robust as I'd like (in case of noise and reverberation). That's next on the list of things to improve.
I tried it with music and the results were spooky. Very ethereal and ghostly. It was only with some classical music though, I might have to do a pop song next and share the results!
Here we go! This is the first minute or so of Penny Lane by The Beatles converted down to a 10KB .bin and then back to a .wav: http://no.gd/pennylane.wav .. unsurprisingly the vocals remain recognizable, but the music barely at all.
Pretty much! It shows off how the codec works to a great extent though as it seems to be misinterpreting parts of the music to be the pitch of the speech, so Paul's voice sounds weird at the start of most lines but okay throughout the lines.
I've also run a BBC news report through the program with better results although it demonstrates that any background noise at all can throw things off significantly: https://twitter.com/peterc/status/1111736029558517760 .. so at this low bitrate, it really is only good for plain speech without any other noise.
Well, in the case of music, what happens is that due to the low bit-rate there are many different signals that can produce the same features. The LPCNet model is trained to reproduce whatever is the most likely to be a single person speaking. The more advanced the model, the more speech-like the music is likely to turn
When it comes to noisy speech, it should be possible to improve things by actually training on noisy speech (the current model is trained only on clean speech). Stay tuned :-)
Can you try it with Tom's Diner by Suzanne Vega? It's sung without any instruments, and an early version of MP3 reportedly was a disaster on that song.
At some point she sings "loose" instead of "nice" in the compressed version, and a bit later it also sounds like "lulk" instead of "milk". So it's a bit lossy even with respect to the lyrics!
Curious, it definitely works, but the domain is "weird" enough that certain firewalls or proxies may have trouble, perhaps. I've put it at https://gofile.io/?c=F5gle3 as an alternative.
I suspect the reason that excerpt sounds so bad is because the music has several instruments playing at once. One doesn't generally design a vocoder to deal with more than one voice. As that except plays, you can hear that the most prominent instruments (eg: the bass at several moments) sound pleasing, albeit speech-like.
It would probably different from the original music, but pleasant, if one processed each track separately.
Right. This form of compression assumes a primary single pitch, plus variations from that tone. You can hear it locking into different components of the song and losing almost everything else.
Heavy compression of voice is vulnerable to background noise.
I miss the classic telco 8K samples per second, 8 bits. We used to think that was crappy audio.
I use --vbr --bitrate 16 and it feels indistinguishable from the original for podcasts. As opusenc takes only wav for input and does not use multiple cores, I had to write scripts for parallel re-encoding of stuff.
There's some really exciting progress that could be made in this space. The page mentions that they could use it as a post-processing filter for Opus for better decoding to avoid changing the bitstream. It could also be useful as a way to accommodate for packet loss and recover "just enough" to avoid interrupting the conversation.
* encoding audio through a neural network for network transmission
Only if Motorola gets out of the way and supports modern codecs and standards. Current public safety radio networks are using ancient TDMA tech that Motorola has cobbled together, along with audio codecs that shred voice quality. The only good part is the durability of the pricey radio, some are even intrinsically safe.
Public safety digital radio networks are primarily APCO Project 25 (P25) which use IMBE/AMBE developed by DVSI. Motorola's original digital radios used a proprietary vocoder called VSELP (also used by iDEN/Nextel). When APCO standardized public safety digital radios, they rejected VSELP and chose IMBE from DVSI instead. Personally I think VSELP sounds better than IMBE, and I'm not sure IMBE was chosen due to technical superiority or if it was political reasons (i.e. picking a non-Motorola solution due to Motorola's dominance). Also, APCO Project 25 Phase 1 was not TDMA, however Phase 2 is.
Public safety radio is a true mission critical service that moves slowly - equipment lasts years or decades and is expensive and not frequently replaced or upgraded, hence new technology adoption is slow. Vocoder choice is driven by a standards committee for interoperability (which has seen more emphasis since 9/11), and of course committees aren't typically known for working fast. Public safety radio is definitely not a place for a "move fast and break things" mentality.
Just to put this in perspective, a traditional phone line encodes 56 Kb/s of data, which was believed to be the size channel to send the human voice with a reasonable quality. They are doing it in 1.6 Kb/s!
There are band filters and such on legacy, fully analog systems.
G.711 (which is standard now for non-cellular call audio) is a step down, but Opus at 16Kbps sounds better to me than a classic, full analog system due to the lack of band cutoff & smarter encoding.
For those interested in low bandwidth audio codecs, take a look at the voice codec used for Iridium handheld satellite phones, which was finalized in about 1998. Fully twenty plus years ago.
It doesn't sound the best, but consider the processing power constraints it was designed with...
Iridium appears to be using a vocoder called AMBE. Its quality is similar to the one of the MELP codec from the demo and it also runs at 2.4 kb/s. LPCNet at 1.6 is a significant improvement over that -- if you can afford the complexity of course (at least it'll work on a phone now).
Based on my previous experience with Iridium I believe it actually operates at a data rate up to about 3 to 3.2 kb/s. 2400 bps of it is actual usable voice payload, the remaining 600 bps is FEC.
Iridium data (not the new next-generation network) service is around the same speed, it's 2400 bps + whatever compression v42bis can gain you. For plain text and stuff it can be a bit faster, something that's already incompressible by v42bis will be right around 2400 baud.
The examples sound excellent. Top (equal or better) of any text-to-speech synthesizer I've ever heard. I would love to start using it for audio books and for VoIP to save space and traffic as soon as possible. And a Linux-native text-to-speech synthesizer capable of producing speech of this quality is a thing I dream of (now the only option I know is booting to Windows and using Ivona voices)
This is really amazing work. Nice to see LPC pushed to its limits, and I can't wait to see what's next for speech compression. Here's hoping the mobile companies pick up on something similar soon.
Similar as in the same approach, or as in "apply neural networks to all the things"? Because if it's the former, this approach was very specifically tailored to human speech, taking into account how much it can compress/interpolate qualities like pitch and the spectral envelope. That's far too specific to apply to video.
As for the latter, you'd have to perhaps feed Google Scholar the right incantations or ask someone with knowledge. As far as I know, video codecs already have a huge bag of tricks they use (for example the B-frames borrowed in this post). Even then, the key points in this codec were that firstly it's meant for use at very low bitrates, where existing codecs break down, and then secondly it's a vocoder, so it's converting audio to an intermediate form and resynthesizing it. That kind of lossiness is acceptable for audio, but I'm not sure how it would work acceptably for video.
I don't know much about phone tech, are the basebands really doing math or just instrumenting? My assumption would be that there is just some sensor writing to a buffer at a high frequency but that whatever processes that buffer operates at a lower frequency.
Your question is hard to parse? What is instrumenting? If it helps though... the word “baseband” itself is the lower frequency containing just the bandwidth of the signal. Ie that is the lower frequency...
Keep in mind that the very first CELP speech codec (in 1984) used to take 90 seconds to encode just 1 second of speech... on a Cray supercomputer. Ten years later, people had that running in their cell phones. It's not just that hardware keeps getting faster, but algorithms are also getting more efficient. LPCNet is already 1/100 the complexity of the original WaveNet (which is just 2 years old) and I'm pretty sure it's still far from optimal.
This is roughly >100x computation for 2x improvement, which might sound great, except we are already talking single digit Kbits here, hence diminishing returns.
Opus is awesome and covers a previously unmatched spectrum of use cases... but that isn't everything.
Opus isn't good enough to be a replacement for AMBE for use over radio. Opus doesn't make it easier to make very high quality speech synthesis, etc.
Opus loss robustness could be much better using tools from this toolbox-- and we're a long way from not wanting better performance in the face of packet loss.
Opus is still improving, v1.1 to v1.2 then onto v1.3 (current in FFMpeg) saw huge reductions in compute for encoding, and the minimum bitrate for stereo wideband fall year after year.
The limiting factor for Opus's penetration has been compute, FEC is still rarely supported on VOIP deskphones due to this, ditto for handling multiple Opus calls at once.
Not really, no. Especially not if this is implemented in a specialized accelerator. A GFLOP is not that much there. Also, like most other neural network algorithms, this could also be done in fixed point, thereby further reducing the computational cost.
There are technologies to compress deep networks by pruning weak connections. I don't believe the author is using this, so it's likely the computational cost could be reduced by a factor of 10. It could also be that simple tweaks to the NN architecture also work (was the author aiming for using a network as small as possible to begin with?).
Actually, what's in the demo already includes pruning (through sparse matrices) and indeed, it does keep just 1/10 of the weights as non-zero. In practice it's not quite a 10x speedup because the network has to be a bit bigger to get the same performance. It's still a pretty significant improvement. Of course, the weights are pruned by 16x1 blocks to avoid hurting vectorization (see the first LPCNet paper and the WaveRNN paper for details).
For those too impatient to read the details, check out the "Hear for yourself" examples toward the bottom of the page. They're reproducing decent sounding speech at 1.6 kbps.
1.6 kbps is nuts! I like to re-encode audio books or podcasts in Opus at 32 kbps and I consider that stingy. The fact that speech is even comprehensible at 1.6 kbps is impressive. As the article explains, their technique is analogous to speech-to-text, then text-to-speech.
The original recordings are a little stiff, and the encoded speech is a little more stiff. It isn't perfect, but it's decent. It'll be interesting to hear this technique applied to normal conversation. If regular speech holds up as well as their samples, it should be perfectly adequate for conversational speech. At 1.6 kbps, which is absurd.
Also, I wonder how well this technique could be applied to music. My guess is that it won't do justice to great musicians ... but it might be good enough simple pop tunes.
I’ve done experiments with Opus that produce intelligible (but ugly) speech at 2.3 kbit/s. It involves downsampling the Opus stream at the packet level—e.g. transmit only one out of every three packets. It was surprisingly easy. Nothing as sophisticated as what’s going on here.
Also based on work by Xiph. Possibly using the same LPC used here.
Actually, this won't work at all for music because it makes fundamental assumptions that the signal is speech. For normal conversations, it should work, though for now the models are not yet as robust as I'd like (in case of noise and reverberation). That's next on the list of things to improve.
I tried it with music and the results were spooky. Very ethereal and ghostly. It was only with some classical music though, I might have to do a pop song next and share the results!
Here we go! This is the first minute or so of Penny Lane by The Beatles converted down to a 10KB .bin and then back to a .wav: http://no.gd/pennylane.wav .. unsurprisingly the vocals remain recognizable, but the music barely at all.
As imagined by Marilyn Manson...
Pretty much! It shows off how the codec works to a great extent though as it seems to be misinterpreting parts of the music to be the pitch of the speech, so Paul's voice sounds weird at the start of most lines but okay throughout the lines.
I've also run a BBC news report through the program with better results although it demonstrates that any background noise at all can throw things off significantly: https://twitter.com/peterc/status/1111736029558517760 .. so at this low bitrate, it really is only good for plain speech without any other noise.
Well, in the case of music, what happens is that due to the low bit-rate there are many different signals that can produce the same features. The LPCNet model is trained to reproduce whatever is the most likely to be a single person speaking. The more advanced the model, the more speech-like the music is likely to turn
When it comes to noisy speech, it should be possible to improve things by actually training on noisy speech (the current model is trained only on clean speech). Stay tuned :-)
Can you try it with Tom's Diner by Suzanne Vega? It's sung without any instruments, and an early version of MP3 reportedly was a disaster on that song.
Here you go: http://no.gd/vega2.wav
It holds up ridiculously well considering the entire song compresses down to 25392 bytes.
The lyrics of the song are 1200 characters long, so this version of the song only takes up twenty times more space than the written lyrics.
At some point she sings "loose" instead of "nice" in the compressed version, and a bit later it also sounds like "lulk" instead of "milk". So it's a bit lossy even with respect to the lyrics!
I'm getting a 404 on this
Curious, it definitely works, but the domain is "weird" enough that certain firewalls or proxies may have trouble, perhaps. I've put it at https://gofile.io/?c=F5gle3 as an alternative.
That one works! And now I'm going to have nightmares.
Compare it with this now: https://youtu.be/lHjn8ffnEKU :-)
Could you also try "I Feel Love" by Donna Summer?
I am curious how it sounds when there is a really active bassline and lead synth.
http://no.gd/donna2.mp3
The vocal sections just sound like someone clearing their throat out.
I suspect the reason that excerpt sounds so bad is because the music has several instruments playing at once. One doesn't generally design a vocoder to deal with more than one voice. As that except plays, you can hear that the most prominent instruments (eg: the bass at several moments) sound pleasing, albeit speech-like.
It would probably different from the original music, but pleasant, if one processed each track separately.
Sounds like a typical LPC encoder at a low bitrate, like maybe 5 kbps.
Hilariously nightmarish. I'm going to use this for my alarm clock...
Right. This form of compression assumes a primary single pitch, plus variations from that tone. You can hear it locking into different components of the song and losing almost everything else.
Heavy compression of voice is vulnerable to background noise.
I miss the classic telco 8K samples per second, 8 bits. We used to think that was crappy audio.
For comparison, adaptive GSM encodings, which are in use for cellphones today, are also in the single-digit kbps.
https://en.wikipedia.org/wiki/Adaptive_Multi-Rate_audio_code...
I use --vbr --bitrate 16 and it feels indistinguishable from the original for podcasts. As opusenc takes only wav for input and does not use multiple cores, I had to write scripts for parallel re-encoding of stuff.
I like to use Makefiles for parallel encoding.
Make -j4 or whatever. There are a few other ways to do this (e.g. xargs).
I've been waiting for someone to do this* with audio and/or video. Amazing work.
Also worth reading this related link: https://www.rowetel.com/?p=6639
There's some really exciting progress that could be made in this space. The page mentions that they could use it as a post-processing filter for Opus for better decoding to avoid changing the bitstream. It could also be useful as a way to accommodate for packet loss and recover "just enough" to avoid interrupting the conversation.
* encoding audio through a neural network for network transmission
For comparison, your standard police/fire/medical digital radio in the US sends voice at 4.4Kb/s. So this is a approximately a third of that.
So maybe this line of work will mean more spectrum available in the future.
Only if Motorola gets out of the way and supports modern codecs and standards. Current public safety radio networks are using ancient TDMA tech that Motorola has cobbled together, along with audio codecs that shred voice quality. The only good part is the durability of the pricey radio, some are even intrinsically safe.
Public safety digital radio networks are primarily APCO Project 25 (P25) which use IMBE/AMBE developed by DVSI. Motorola's original digital radios used a proprietary vocoder called VSELP (also used by iDEN/Nextel). When APCO standardized public safety digital radios, they rejected VSELP and chose IMBE from DVSI instead. Personally I think VSELP sounds better than IMBE, and I'm not sure IMBE was chosen due to technical superiority or if it was political reasons (i.e. picking a non-Motorola solution due to Motorola's dominance). Also, APCO Project 25 Phase 1 was not TDMA, however Phase 2 is.
https://en.wikipedia.org/wiki/Project_25
Public safety radio is a true mission critical service that moves slowly - equipment lasts years or decades and is expensive and not frequently replaced or upgraded, hence new technology adoption is slow. Vocoder choice is driven by a standards committee for interoperability (which has seen more emphasis since 9/11), and of course committees aren't typically known for working fast. Public safety radio is definitely not a place for a "move fast and break things" mentality.
and uses codec with up to couple hundred MIPS (AMBE 1/+2) computation cost, not GFLOPS
Just to put this in perspective, a traditional phone line encodes 56 Kb/s of data, which was believed to be the size channel to send the human voice with a reasonable quality. They are doing it in 1.6 Kb/s!
Aren't "traditional" aka POTS lines analog, and therefore not doing any encoding whatsoever?
There are band filters and such on legacy, fully analog systems.
G.711 (which is standard now for non-cellular call audio) is a step down, but Opus at 16Kbps sounds better to me than a classic, full analog system due to the lack of band cutoff & smarter encoding.
For those interested in low bandwidth audio codecs, take a look at the voice codec used for Iridium handheld satellite phones, which was finalized in about 1998. Fully twenty plus years ago.
It doesn't sound the best, but consider the processing power constraints it was designed with...
https://en.wikipedia.org/wiki/Iridium_Communications#Voice_a...
Iridium appears to be using a vocoder called AMBE. Its quality is similar to the one of the MELP codec from the demo and it also runs at 2.4 kb/s. LPCNet at 1.6 is a significant improvement over that -- if you can afford the complexity of course (at least it'll work on a phone now).
Based on my previous experience with Iridium I believe it actually operates at a data rate up to about 3 to 3.2 kb/s. 2400 bps of it is actual usable voice payload, the remaining 600 bps is FEC.
Iridium data (not the new next-generation network) service is around the same speed, it's 2400 bps + whatever compression v42bis can gain you. For plain text and stuff it can be a bit faster, something that's already incompressible by v42bis will be right around 2400 baud.
The examples sound excellent. Top (equal or better) of any text-to-speech synthesizer I've ever heard. I would love to start using it for audio books and for VoIP to save space and traffic as soon as possible. And a Linux-native text-to-speech synthesizer capable of producing speech of this quality is a thing I dream of (now the only option I know is booting to Windows and using Ivona voices)
Mozilla Deepspeech (Speech tp Text) and Mozilla TTS are both useful at this point: https://research.mozilla.org/machine-learning/
This is really amazing work. Nice to see LPC pushed to its limits, and I can't wait to see what's next for speech compression. Here's hoping the mobile companies pick up on something similar soon.
Voice quality at that bit rate is absolutely astounding to me. This is one of my favorite things – to see such elegantly applied research.
Would be really cool when this is in FreeDV and the HF bands benefit from this!
My dream is for telecom monopolies to get broken up by technologies like this.
might be fun to try to port this to the dadamachines doppler... :)
OK, so... Would it be possible to do something similar for video?
Similar as in the same approach, or as in "apply neural networks to all the things"? Because if it's the former, this approach was very specifically tailored to human speech, taking into account how much it can compress/interpolate qualities like pitch and the spectral envelope. That's far too specific to apply to video.
As for the latter, you'd have to perhaps feed Google Scholar the right incantations or ask someone with knowledge. As far as I know, video codecs already have a huge bag of tricks they use (for example the B-frames borrowed in this post). Even then, the key points in this codec were that firstly it's meant for use at very low bitrates, where existing codecs break down, and then secondly it's a vocoder, so it's converting audio to an intermediate form and resynthesizing it. That kind of lossiness is acceptable for audio, but I'm not sure how it would work acceptably for video.
I should have been more specific. I meant that instead of compressing video to minimise pixel difference, minimise feature difference instead.
Really thought this was gonna be human-neural-activity-to-speech and I feel like a doofus.
3 Gflops, we are deep beyond diminishing returns here. Opus seems good enough.
3 GFLOP/sec sounds like a lot but it's considerably less math than the radio DSPs inside any modern phone's baseband is doing during a phone call.
I don't know much about phone tech, are the basebands really doing math or just instrumenting? My assumption would be that there is just some sensor writing to a buffer at a high frequency but that whatever processes that buffer operates at a lower frequency.
Your question is hard to parse? What is instrumenting? If it helps though... the word “baseband” itself is the lower frequency containing just the bandwidth of the signal. Ie that is the lower frequency...
True but here I intended "baseband" to mean https://en.wikipedia.org/wiki/Baseband_processor
I wonder if a particular network can be implemented more economically if we only run it in "transformation" mode, without the need do train it.
Keep in mind that the very first CELP speech codec (in 1984) used to take 90 seconds to encode just 1 second of speech... on a Cray supercomputer. Ten years later, people had that running in their cell phones. It's not just that hardware keeps getting faster, but algorithms are also getting more efficient. LPCNet is already 1/100 the complexity of the original WaveNet (which is just 2 years old) and I'm pretty sure it's still far from optimal.
This is roughly >100x computation for 2x improvement, which might sound great, except we are already talking single digit Kbits here, hence diminishing returns.
Opus is awesome and covers a previously unmatched spectrum of use cases... but that isn't everything.
Opus isn't good enough to be a replacement for AMBE for use over radio. Opus doesn't make it easier to make very high quality speech synthesis, etc.
Opus loss robustness could be much better using tools from this toolbox-- and we're a long way from not wanting better performance in the face of packet loss.
Opus is still improving, v1.1 to v1.2 then onto v1.3 (current in FFMpeg) saw huge reductions in compute for encoding, and the minimum bitrate for stereo wideband fall year after year.
The limiting factor for Opus's penetration has been compute, FEC is still rarely supported on VOIP deskphones due to this, ditto for handling multiple Opus calls at once.
This is roughly 2x improvement over AMBE+2, except AMBE peaks at maybe couple hundred MIPS, and there are better less computationally intensive alternative, like 20-70 MIPS https://dspini.com/vocoders/lowrate/twelp-lowrate/twelp2400
Not really, no. Especially not if this is implemented in a specialized accelerator. A GFLOP is not that much there. Also, like most other neural network algorithms, this could also be done in fixed point, thereby further reducing the computational cost.
There are technologies to compress deep networks by pruning weak connections. I don't believe the author is using this, so it's likely the computational cost could be reduced by a factor of 10. It could also be that simple tweaks to the NN architecture also work (was the author aiming for using a network as small as possible to begin with?).
Actually, what's in the demo already includes pruning (through sparse matrices) and indeed, it does keep just 1/10 of the weights as non-zero. In practice it's not quite a 10x speedup because the network has to be a bit bigger to get the same performance. It's still a pretty significant improvement. Of course, the weights are pruned by 16x1 blocks to avoid hurting vectorization (see the first LPCNet paper and the WaveRNN paper for details).