322 points by jlpcsl
2 months ago
Created a few years ago by Jean-Marc Valin of xiph/mozilla (who by the way is also the author of Opus codec among other things): https://gitlab.xiph.org/xiph/rnnoise/
Overview of RNNoise from the horse's mouth is here: https://jmvalin.ca/demo/rnnoise/
Used as a Wasm module! In some ways the web is becoming more opaque. Is this the future then, a hodgepodge of binaries doing things behind the scenes? Though in this case it happens to be OSS, and it may well be a moot point -- backend is already a blackbox to the enduser, now parts of frontend are blackboxes. The practical implication is probably just that some measure of customizability is gone.
Is this a troll comment? Yes, wasm works based on a compiled binary, just like any other program written in a compiled language in the past 50 years. You try to suggest that everyday users of the web are just going into the js sources of webpages and understand whats going on. With the plethora of libraries, frameworks and static optimization used in todays websites, normal people can't really dissect the inner workings of a website just by looking at the code. That's why we have tools like request analyzers etc which all would still work with compiled libraries.
Compiled code has existed for half a century and we know how to work with it.
They didn't suggest the web is doomed, just that more aspects of it are opaque. I don't think they're talking about every day users of the web either, but rather nascent developers.
The early web was a great equalizer. Anybody could study a little html, download an ftp manager, jump through a few procedural hoops and have a web page. After some studying and trial and error they could even build an interactive site.
It's easy to miss all the potential of wasm when that's what you remember of the web. To me the amazing thing is that browsers will still work with the methods described above but we're on the cusp of being able to do almost everything a full application environment can do.
That said, even though there will be plenty of OSS wasm tech, it'll still be more opaque to those of us who don't do compiled languages. It'll be a lot tougher to just fork the code and do something more creative with it.
 PHP used to stand for "Personal Home Page" and, as one of its founders put it, was created so that "any idtiot" could make an interactive site.
Are modern-day "no code" tools like Webflow not an acceptable equivalent?
We already lost any semblence of building from scratch in the mid-2000s with the emergence of gargantuan HTML templates and Wordpress/Drupal/PHPbb deployments with plugins and themes.
This is a direct result of people being held to higher standards and thus spending a lot more effort overriding the compositional and behaviour defaults of the user agent.
The modern-day iteration just optimizes for scaling up to tens of thousands of concurrent end-users on anemic hardware.
We have to accept the fact that personal webpages gave way to social network profile pages. This didn't happen overnight and there is zero demand for a hand-crafted presence on the web anymore.
No, an environment for writing new code is not any kind of equivalent for the ability to reverse-engineer existing code. Firebug and its clones are a much closer equivalent than anything like WebFLow.
Build from scratch is out of favor, but not necessarily that far off. Folks like Github & Youtube have very simple bottom-up webcomponent systems they use, rather than top doen frameworks. Existing concerns about bundling might be met by bundled http exchamges (webpackage).
I dont think "no code" is an aid. If anything it's pushing in the opposite direction: rather than a transparent approachable web medium, it suggests we need hyperadvanced tools that we really wont understand or have control over to synthesize web code. It's a simpler user experience, but a push away from notepad.exe webdev.
I wouldnt rush to make any conclusions about who or what has won, as a settled fact & case for all time. We havent had good ways to run online systems ourselves, versus hosted for us, and there's still lightyears to go but we're doing good things & finally maturing well. We're only a couple years into ActivityPub as an interchange format & growing many of the caoabilities & tools & systems, around all mimds of use cases, that will make throwong together a fair, interactabke competitive offering possoble. Social media has had huge huge investmemt poured into it, but we are in decent preteen years of growing up & owning the libre equivalents. We can assess demamd only after there is a visualizable state people can imagine; just having an isolated blog is not the equivalent to the well connected social media site, but these capabilities slowly arise. Follow the alpha geeks; this currently long phase will not be forever.
Sure “everyday users” aren’t clicking “View Source”, but that’s not really what the issue is about.
When I was a kid, every piece of software I used was pre-compiled, and therefore opaque. This made it difficult for me to figure out how people made certain things, and after a while I lost interest in programming.
When I got back into it later, one thing that made a huge difference was being able to see how various cool JS sites were built. The ability to “View Source” like that was revolutionary, and also allowed me to build some early fun projects, like a Cookie Clicker “AI” that could play the game automatically by calling the functions I could see in the game’s source.
I’m far from the only person with experiences like these. Yes, there was programming before View Source and there will be programming after. And for those of us with the right tools or reverse engineering skills, View Source isn’t particularly relevant. What we’re losing is a pipeline that helped people become/stay interested in programming, which makes it likely that future programmers who would’ve followed a path like mine will do something else instead.
On the other hand, it's never been as easy to contribute to OSS projects as it is now. Github has severely lowered the requirements compared to earlier settings where you had to get an e-mail client, configure it in just the right way, etc. You have live coding youtubers, there are discord communities for all types of technology, and knowledge about programming and technology is extremely available through Google, way more than it was 20 years ago. I think young people still have tons of opportunities to start out.
You're really not going to "view source" and understand anything when all the JS I ship is minified and uglified. Nowadays, JS is simply another compile target.
...I still dissect website code, thank you very much. Basically have to do it just to figure out quirks I'm constantly running into.
(You could technically turn the wasm to JS and unminify that too, which I doubt is much harder/easier to decipher as the same thing written in JS and minified/unminified.)
If we want to be able to trust the web, we have to get rid of the automatic download and execution of arbitrary script code.
I think we’ve been lulled into some false sense of expectation that the web exists as a place for “open source code” to be run. As if the fact that you can view the source of any page is any purveyor of that.
If that’s your definition of transparency, then perhaps learning to read assembly would give you the same comfort. In fact, there’s a lot more binaries distributed with symbols intact than unminified JS.
Or, to put it another way, if you could right click -> view disassembly of any binary on your computer, how would that be any different than today’s web?
What a weird take.
How else would we have implemented this? WASM has facilitated introducing these technologies into web applications, it literally wasn’t possible before.
Thanks to emscripten it wasn’t even that hard to get rnnoise working on WASM: https://github.com/jitsi/rnnoise-wasm
I concede WASM does open the possibility of adding opaque stuff to web apps but IMHO the benefits outweigh the drawbacks at this point.
Oh no you're absolutely right, my general frustration was ill-placed for this thread. Wasm is no doubt the right and only way to have done this.
Short of reproducible builds, you can't even check that what you're being served is, in fact, the OSS version.
Arent wasm modules still sandboxed? Reverse enginering binaries should have around same complexity than reverse enginering uglify js, not sure how they are more opaque
You probably have a point but I'm thinking unuglified js code (http://www.nice2predict.org/) is not as impenetrable as code from reverse engineered wasm binaries? The element of plausible deniability is more potent though for the nefarious actor on the other side in the case of wasm binaries.
Wasm is about distribution of binaries, not about open source. Those are two different subjects.
When I install a program on my debian machine with apt-get, I also get binaries. But this doesn't mean that it is opaque right?
Just wait until all your ads are wasm blobs! Wheeeeeeee!
If you have human-readable JS, you either didn't write much or you forgot to minify it.
Sorry, but at least in my smartphone, I can understand better the unprocessed audio showcased down in the Web page, than the noise-suppresed audio. How is that?
The original audio is significantly easier to understand. This may be technically interesting, but the noise suppression is aggressive to the point that it's eating critical signal with the noise.
> So what should you listen for anyway? As strange as it may sound, you should not be expecting an increase in intelligibility. Humans are so good at understanding speech in noise that an enhancement algorithm — especially one that isn't allowed to look ahead of the speech it's denoising — can only destroy information. So why are we doing this in the first place? For quality. The enhanced speech is much less annoying to listen to and likely causes less listener fatigue.
> Actually, there are still a few cases where it can actually help intelligibility. The first is videoconferencing, when multiple speakers are being mixed together. For that application, noise suppression prevents the noise from all the inactive speakers from being mixed in with the active speaker, improving both quality and intelligibility. A second case is when the speech goes through a low bitrate codec. Those tend to degrade noisy speech more than clean speech, so removing the noise allows the codec to do a better job.
I do think that for direct listening, the jitsi.org speech samples would be slightly more intelligible if the noise removal was tuned to pass through frequencies with mixed noise and signal. I don't know if that would be worse in a video conference. Does the speaker or listener get to choose between conservative and aggressive noise removal?
This is the default for online conferencing. Everyone is way better off asking other party to repeat couple of words than listening for all that noise during the whole call.
> Everyone is way better off asking other party to repeat couple of words than listening for all that noise during the whole call.
I didnt understand the first three words, for Alice it was the next two, and for Bob it was the last four. How many people are going to ask to repeat?
Evolution taught us to understand over the sound of waves, crickets, rain, thunder, and more. It didn’t teach us to comprehend with half the signals masked.
But this might be better served with a simplistic voice activity detection, like in mumble.
I’m using RNNoise as a pipewire input filter on my Linux machines, but that’s very Linux-specific and a bit “hardcore” to setup.
Nice to see it getting integrated into video meeting solutions, so more people can take advantage of this awesome library.
>but that’s very Linux-specific and a bit “hardcore” to setup
Have you tried https://github.com/noisetorch/NoiseTorch/?
Or https://github.com/wwmm/easyeffects for noise reduction and other effects like compression and EQ for a real crooner voice in any video call application.
Definitely recommend easyeffects over noisetorch. No root, high quality GUI and can work automatically in startup. I only use the noise suppression 99% of the time but having the other effects available can also be fun.
Could you please elaborate how to do it or post some good/not outdated links?
Don’t remember exactly which guide I followed, but I used the build from this repo, and the instructions looks plausible:
Bummer, reading the title I thought Jitsi had a new de-noiser because they had RRNoise for some time. Unfortunately RRNoise has not received much advancement for a couple years. It's by now half a decade old tech. I've worked with the WASM version in the past but it can be hit or miss. Sometimes it makes the audio you want a bit weird. It also added something like 10% CPU usage and in the end we disabled it again.
I'd love to see some more state of the art solution that works with WASM. Maybe even something that one could train on their own voice and filter everything else would be awesome. Because all the noise cancellation tech does not help if you sit in an environment with other people talking next to you and the AI doesn't filter it because it's voices. Sometimes coworkers use Krisp but even that proprietary paid solution is so-so.
While we’ve had rnnoise integration for a while it was for “noisy environment” notifications, this is the first time we use it to actually filter audio.
Also audio worklets weren’t a thing when we first introduced it.
I’m not aware of any other open source (and better) models, but if any come up, we’ll certainly check them out!
Thanks for the clarification. We also experimented with audio worklets + rrnoise about 1.5 years or so ago but had very mixed results. The potential upside with processing in another thread is clear but some browser and OS combinations just didn't work well and resulted in micro stutters in the audio. I remember Chromium on Linux for example being finicky. Some browsers worked better with smaller buffers, some needed bigger ones. We spent too much time debugging and tuning for different systems and the audio quality improvement was not deemed good enough so we shelved the effort. I guess audio worklets improved since then and probably is more useable by now. Do you guys have some kind of performance monitoring for the noise cancellation or audio in general?
At the time I also spent a few days looking for something better but didn't really find anything. Unfortunately RRNoise is the best we have :( The only other noise cancellation software that actually impressed me was the one from Nvidia but that's not something that one could integrate via WASM and of course wouldn't work on most devices anyways.
Oh what a day it will be where we have energy efficient hardware encoders for AV1 in every device plus some really good noise cancellation. Oh and then we just need internet connections without packetloss :P
If you have any involvement with Jigasi or might be in the know -- are there plans to use whisper, for instance, instead of Google's API for transcription? If I recall correctly jigasi is using google's API, local transcription aligns well with the rest of Jitsi's missions.
The problem for Jigasi's speech-to-text feature with Whisper - or any recent SOTA speech-to-text neural networks, is that they are transformer-based. One of the key features of transformers is that they are very good at processing a sequence with the attention mechanism. But attention inherently needs to see the whole input sequence. So it's difficult to adapt these architectures to perform well in real-time scenarios like captioning meetings.
Yes! But a part of the Jitsi ecosystem enables recordings and whisper is a good candidate to use for these recorded sessions.
On that topic — they record sessions in an interesting way, basically an instance of chrome is started and captured... I think with OBS. That always made me raise an eye but I also can’t think of up a better way.
edit: It's actually jibri which has to do with recording. Gosh I wish the names were a liiiittle more intuitive. :)
We do have VOSK support already. I haven’t heard of whisper, but it does sound like a good GSoC project for next year!
If I have time I'll try to help you guys out. I'm a big fan of what you're doing. :)
It does seem to do a good job of eliminating noise but it seems like it gets rid of a lot of the signal too. It's much easier to understand the noisy sample than the processed one
Somewhat tangential, but at my work we have found WebEx’s background noise removal to be absolutely amazing. So many times we’ve had someone in a meeting say “sorry about X/Y/Z, it’s so noisy”, and the rest of us won’t hear a thing. This sorta tech has gotten so good, and is a really nice quality of life improvement for remote work. (Or for meetings with people in noisy offices of course)
Very very good, a little bit of stuttering during the honking I think but I like it overall! :D :)
Jitsi Meet has been a great alternative to other meeting apps in these crazy times.
Indeed! I recently used a locally hosted Jitsi to talk to my family in the other room while in COVID isolation. It was a life saver, and extremely easy to setup with docker-compose with only a handful of steps that I could complete even with fever+headache https://jitsi.github.io/handbook/docs/devops-guide/devops-gu...
My experience with Jitsi Meet has been quite bad. My previous employer was a cheapskate, and they self-hosted Jitsi Meet. Random disconnections and instability were pretty much a daily occurrence, some people were disconnected every few seconds. While I suppose the self-hosting by Cheapskate Inc. was the main culprit, Jitsi’s screen sharing wasn’t looking very good.
My last employers were cheapskates too (I love 'em for it) and they just used meet.jit.si for calls. It was a lot more stable than self-hosted jitsi. That said there were almost always microphone or video issues using it, just because people weren't used to it I guess. It made job interviews fun. It was a nice live test to show how a potential employee handled adversity.
We’ve made significant tweaks to screen-sharing in the past 2-3 stable releases, in case you feel inclined to check us out again :-)
Aside from consuming a ton of resources when screen sharing, my experience with Jitsi meet has been very good. It consumes two cores of my 5900X (1 for the Firefox process, and another for some system process I don’t recall exactly) but it works. This was with sharing a 4K screen.
I have run jitsi on cheap VMs and it worked decently. But you need quite some cores to serve all the traffic. Ultimately I ended up having as many 2-4core VMs as I had concurrent calls.
how is the meet.jit.si hosted? i assume with lots and lots of random users, the bandwidth and processing costs to be astronomical
For a large proportion of users I expect the audio/video to be peer to peer thanks to WebRTC. Other traffic (for that set of users) should be minimal.
My company self-hosts an instance and it's excellent.
So someone hosted $software on a shitty server, and you blame $software for the shitty performance? To draw any conclusions you should look at meet.jit.si (hosted by Jitsi), no?
I've been using Jitsi Meet regularly for about a year. It's usually fine, but on some days I experience disconnections every several minutes.
Is there video conferencing software that does spatial audio for conferences? What I have in mind is that it is often problematic to understand each other while multiple people are talking. It is much easier in person. I guess it all goes down to ability to focus on directial cues of an audio source. Currently everyone are placed inside one's head so they interfere much more this way.
Apparently FaceTime offers this.  Presumably Apple will allow other companies to do it as well, since they let them offer spatial audio in other contexts.
Why would Apple need to allow it? It's simply a matter of a given program postprocessing the various audio streams appropriately prior to muxing them for output.
Because Apple’s spatial audio varies as you rotate your head. The device knows where your head is located, but the app doesn’t (unless the OS tells it).
You could give up on audio portion of your current Video conferencing setup and just install Teamspeak with spatial plugin https://www.myteamspeak.com/addons/9ddfa0b2-25c2-4302-8a43-0...
Assuming the demo samples aren't rigged, that's a very substantial improvement.
I prefer the sample with the noise. Seems clearer to understand
Would you prefer to listen this noise for half an hour? :)
Or just have them mute and unmute at appropriate times. I do this even in non-noisy environments
Yes, you'll tune out the noise yourself in your brain.
Anyone have tips for using Jitsi? I've been thinking about moving off Zoom now that they're enforcing a 40 min limit even for one-on-one calls.
Does it create friction for folks who haven't used it before? Any suggested instructions to send with a meeting invite?
We've been using jitsi via zulip chat at work. It should be drop-in for at least small groups (one-on-one, handful of people - I have yet to investigate "conference" or "class room" size).
We do unfortunately see semi-regular lock-up/freezes where one end of the stream stops for ~30 seconds. Maybe this is worse in safari vs chrome/Firefox - we have not yet experimented much with different browsers. Or maybe there's a difference between x86_64 and arm/m1/m2.
AudioWorklet and WebAudio in general are buggy in Safari.
Thanks! I'll try and see if we fare better with Firefox.
As someone invited to a Jitsi meeting a while ago, not having any video background removal, a lot less audio processing and what looked like no video processing at all meant everyone was harder to understand, harder to see and any activity or clutter in the background was fully visible of course. I guess buying quality microphones and cameras for everyone involved would help. Detailed instructions are a good idea as well, I struggled a bit with the unfamiliar interface.
Personally, I'd stick with the big names, long remote meetings are strenuous enough even with all the quality of life features those offer.
I love Jitsi. Super easy to get people to join and learn the UI. Reliable and fast. Audio beats regular phone calls.
I note that RNNoise was trained on proprietary data and thus would be considered a "toxic candy" ML model under the Debian Deep Learning Team's Machine Learning Policy:
Personally, I use Nvidia Broadcast which works great with noise suppression, and, best of all, works with any app because it shows up as a standard audio device. It also has camera features such as sharpening and visual noise reduction but I don't use those much. Caveat that it requires an Nvidia GPU but if you have one, it is better than any other noise suppression system I've seen.
It requires not just any Nvidia GPU but "NVIDIA GeForce RTX 2060, Quadro RTX 3000, TITAN RTX or higher"...
This looks interesting.
Most devices have noise suppression built in. Do you use RNNoise in tandem with the hardware noise suppression, or do you disable it?
Also, are there any plans to implement this with react-native on mobile? And if so, how would you implement this, since audio worklets and WebAudio aren't available.
Aside, Jitsi is pretty awesome for creating an video app idea
quickly. The API is very easy to use.
SignalWire's SDK is another option and it uses an MCU instead of SFU: https://signalwire.com/products/cloud-video
Wow their prices remind me of 90s phone calls
This is one of the filters OBS use too (the other is Speex which is obsolete to some extent)
I haven't heard of Jitsi before this. The tech looks promising! But I can't find the link to download the software for PC/Mac, only android and ios links. Is it mobile only?
Jitsi Meet is used directly in the browser. Like at this address: https://meet.jit.si. You can also set up your own instance.
A desktop client also exists for Windows, macOS, Linux: https://github.com/jitsi/jitsi-meet-electron - kind of not really advertised, provides remote desktop control contrary to the strictly web browser version.
It's a free video conference service.
Rare to find creative real time small-weight uses of ML but I love when it's done and this has an impressive and well written explanation with it as well. Great stuff.