| Svelte Hacker News

points by JimDabell 6 years ago

It would be interesting to see how far you could get using deepfakes as a method for video call compression.

Train a model locally ahead of time and upload it to a server, then whenever you have a call scheduled the model is downloaded in advance by the other participants.

Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end. When the tech is a little further along, it should be possible to get good quality video using only a fraction of the bandwidth.

teraflop 6 years ago

This is a minor plot point in Vernor Vinge's excellent SF novel A Fire Upon the Deep.

One of the premises of the novel's universe is that computational power is generally absurdly plentiful, but communications bandwidth over interstellar distances is not. Most communications are in plain text (modeled after USENET) but in some cases, "evocations" are used to extrapolate video and audio from an ultra-compressed data stream.

The trouble, of course, is that it's not very obvious what aspects of the image you're seeing are real, and what aspects were dreamed up by the system doing the extrapolating.

Aperocky 6 years ago

> it's not very obvious what aspects of the image you're seeing are real, and what aspects were dreamed up by the system doing the extrapolating.
It would be quite obvious unless the raw data before extrapolating is destroyed, for which there are no reason nor is it possible to stop others in the vincinity from receiving this raw data.
- teraflop 6 years ago
  
  That assumes that the "raw data" is reasonably human-comprehensible (which neural network weights and activations are notoriously not) and/or that you have time to sit down and analyze the data at your leisure.
  But saying more would be spoilery...
- read_if_gay_ 6 years ago
  
  For that to be true the compression algorithm mustn’t be very efficient.
cheez 6 years ago

A main premise of the Fear the Sky trilogy as well but solved a different way. Machines representing various political factions from the home planet are uploaded with AI that mimics them emotionally and politically for all intents and purposes. I really enjoyed this book.
- lgl 6 years ago
  
  +1 recommendation for this trilogy
- luhem7 6 years ago
  
  Eh, I personally enjoyed the series, but I wouldn't recommend anything beyond book 1. Book 2 is ok. Book 3 really spoiled the series for me because of the inconsistent behavior if the main character. (Keeping it vague to avoid spoilers)
  
  cheez 6 years ago
  
  Same. Notice I said "the book" while mentioning the trilogy ;-)

MasterScrat 6 years ago

Google recently introduced something like that for Audio in Duo: https://ai.googleblog.com/2020/04/improving-audio-quality-in...

> WaveNetEQ is a generative model, based on DeepMind’s WaveRNN technology, that is trained using a large corpus of speech data to realistically continue short speech segments enabling it to fully synthesize the raw waveform of missing speech.

I don't think you need to train for each person specifically, you can just train a model for all heads, then maybe transmit a few high quality pics when the call starts, and interpolate from that afterward.

lambdaba 6 years ago

Excellent idea and we'll surely be seeing something like this, there are AR apps that already map facial expressions to avatars.

Downside could be some uncanny valley if the models are not very high quality.

But if I had to make a prediction, I'd expect we'll get much more value from higher bandwidth, ultra high definition streaming and features like 3d cameras / virtual reality. I think we have a tendency to really underestimate how important high definition is for human communication.

4lun 6 years ago

> I'd expect we'll get much more value from higher bandwidth, ultra high definition streaming and features like 3d cameras / virtual reality. I think we have a tendency to really underestimate how important high definition is for human communication.
Low latency is probably more important to me.
Recently I seem to have a 3 second delay on many VC calls at work (and just for me it seems), and I either end up interrupting people or feeling reluctant to talk at all since it becomes impossible to time gaps and conversations right.
Despite that I get a crystal clear HD picture for all participants, but I'd happily sacrifice video quality (in fact I'd accept audio only in some cases) to get a more real time experience (disabling video doesn't seem to have any effect).
- JadeNB 6 years ago
  
  > Despite that I get a crystal clear HD picture for all participants, but I'd happily sacrifice video quality (in fact I'd accept audio only in some cases) to get a more real time experience (disabling video doesn't seem to have any effect).
  If you're really willing to sacrifice video completely, at least for Zoom, and probably for lots of other videoconferencing solutions, you can call into meetings with your phone. In fact, I think Zoom allows you to join with the computer for video and the phone for audio, which might be the best of both worlds.
  
  Scoundreller 6 years ago
  
  Yes, zoom supports that.
  Slight issue in Toronto has been cellular system overloading and calls not being completed. But once on, no problem.
  I can’t blame the providers though. How could they have predicted that people would use the service they’re paying for?
- smichel17 6 years ago
  
  This can be helped with hand-raising (queue style) and a dedicated facilitator for each meeting.
- mlonkibjuyhv 6 years ago
  
  This is a long shot - but are you running on battery while this is happening? Had some weird issues that worked themselves out by plugging in the charger. Probably had to do with power savings and cpu throttling.
giovannibonetti 6 years ago

> Downside could be some uncanny valley if the models are not very high quality.
That can be controlled, since these compression algorithms usually work by making a prediction and sending the difference between the prediction and the actual value.
That works both for lossless compression - where the difference is sent in full - and lossy as well - where only the most important part of the difference is sent.

abakker 6 years ago

Even better would be for RPGs and things like Roll20. I’d love to deep fake different voices/ character faces on cue.

ajflores1604 6 years ago

This is very loosely what nvidias dlss game upscaling does. Generalized NN trained on super high resolution game engine output. You can run a game at like a quarter to half resolution and it upscales the rest.

https://www.nvidia.com/en-us/geforce/news/nvidia-dlss-2-0-a-...

yowlingcat 6 years ago

Very cool idea. The coding used in H264 is a variant of the DCT, so moving one layer of abstraction up from there basically moves from semi-analog to fully digital. I agree that it should only require a fraction of the bandwidth because you'd only be sending parametric data rather than full video.

dheera 6 years ago

I think this is largely possible, and accuracy to a human is very different than MSE accuracy used in a traditional lossy compression algorithm.

To a human, for example, the exact pattern of every strand of hair isn't important at all -- all that matters is that the hairstyle and hair color stays the same.

The algorithm can also not worry about encoding and re-constructing skin blemishes because humans would possibly actually enjoy not having to put on makeup for a video call.

stagas 6 years ago

I was thinking the same thing today. I wonder if it can be done in on-the-spot, like capture your image from the camera initially and then send the rest as data points for deepfake generation on the other side, but based on your own image. That would be amazing for low/limited bandwidth situations.

userbinator 6 years ago

Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end.

The MPEG-4 part 2 actually had something like that, called "face and body animation (FBA)". As far as I know, there are no implementations in widespread use.

mwcampbell 6 years ago

I wonder if the same kind of thing is feasible for someone's voice.

aidos 6 years ago

I believe that’s what the vocoder was created for.
https://en.m.wikipedia.org/wiki/Vocoder
- pbhjpbhj 6 years ago
  
  Mentioned in Jim Akhaleli's (sp?) "Revolutions" episode about the smart-phone currently on Netflix (my lad was watching it, really good for juniors or non-technical people IMO).
  
  klondike_klive 6 years ago
  
  Almost right. It's Jim Al-Khalili. Great show.
mwcampbell 6 years ago

Update: I did some searching and found some interesting demos of a hybrid of neural nets and more conventional DSP called LPCNet:
https://people.xiph.org/~jm/demo/lpcnet_codec/
Sure enough, it was discussed on HN when it came out last year. I think I missed it then.
For those who didn't catch this from the URL, this is by Jean-Marc Valin, of Speex and Opus fame.

peter_d_sherman 6 years ago

I second the already expressed sentiments!

An utterly brilliant great idea!

lowdose 6 years ago

This had already been done with criminals posing as CEOs.