I get what Deepfakes are used for, but Wikipedia went right over my head explaining how they are made – so how is this done, exactly?

1.05K views

I get what Deepfakes are used for, but Wikipedia went right over my head explaining how they are made – so how is this done, exactly?

In: Technology

5 Answers

Anonymous 0 Comments

Basically it uses a few data points to infer the rest.

Like if you had to guess the next numbers in this series. 1, 3, 5, 7, 9, 11, 13, 15… you use the data and you have to guess the rest.

Deepfakes takes actual footage of those people and notes their facial expression, lip movements, voice, etc and then it basically fills in the gaps of whatever you want it to say or do.

For speech, it’s all sound waves and transitions. So you get a sample of their voice and you can basically fill in the gaps. Same thing for facial expressions and lip movements. It’s not always perfect but with feedback it can fine tune itself.

Anonymous 0 Comments

There’s no real way to explain this, because the program was trained to be able to do this. It’s like asking, how did they make this great athlete? Like did you want his training schedule?

For Deep fake the most I can say is it’s a three part system. There’s one AI that can tell the fundamental expression from videos, there’s a second AI that can morph a photo of somebody’s face to any specified expression, and there’s a third AI that can photoshop a face onto an existing photo with natural blending.

Anonymous 0 Comments

To understand that you’d have to understand the concept of a neural network.

[https://www.youtube.com/watch?v=aircAruvnKk](https://www.youtube.com/watch?v=aircAruvnKk)

Essentially, a neural network uses a bunch of multiplication and addition to approximate some function that it is trying to “learn.” In this case it is minimizing the error between what a video with a regular face looks like, and what the video it generates with a new face looks like.

Anonymous 0 Comments

Its a shame Deepfakes got such a bad wrap right off the bat. Yeah people made porn with it.. but sex is always a good innovator. If we allowed the pervy stuff to burn out we coulda got an incredible advancement in VR motion, artificial intelligence simulations, hell who even knows what else. But some celebs got butthurt so now we lost a ton of progress on that technology

Anonymous 0 Comments

You have to understand what a neural network is. A neural network is basically a bunch of math that takes some input and spits out some output. How it does that is controlled by a huge number of “dials” that relate the input to its output, through a series of steps in between. It’s a lot of multiplication and addition, basically. The structure of the neural network (the connections) are preset in some regular pattern (a huge mesh basically), but the strength of each connection is variable. The idea is that no human can figure out how to “set up” all those knobs manually, so instead you give a training program a bunch of examples of what the outputs and inputs should look like, and it automatically tweaks all the knobs to try to make the network get closer and closer to the desired outputs. This is all inspired by how brains work with many neurons connected together with different strength connections that change to “learn”, hence why we call them neural networks. Because they’re just a massive pile of numbers, nobody “understands” how they work, but we can have computers train them to do useful things.

Now deepfakes.

First you take some regular old boring face detection technology (the kind that’s in your camera/phone) and use it on a bunch of videos of the original person (A) and the person you want to replace their face with (B). This gives you the positions of the faces. You then use normal image processing stuff to pull out just the faces. At this point it’s a good idea to have a human check the frames and throw away the ones that the algorithm detected wrong.

Then you feed that into an autoencoder. An autoencoder is a type of neural network that turns an image into a smaller output, basically a small set of numbers (something like 1000 or so), then turns that *back* into an image. You train the network so that it can reproduce the original face (so input = output). The idea is that the simpler set of numbers in between eventually captures the “variation” of the face – the parts that change, like expression, eye movement, lighting, angle, etc – and the neural networks on either side learn how to interpret Person A’s face into that set of parameters (an “encoder”), and then turn them back into Person A’s face (a “decoder”). So you feed the encoder an image of Person A’s face that is smiling and looking to the left, and you get out some set of 1000 numbers that in some way or another represent “smiling, looking to the left”, which you can turn back into (something close to) the original face with the decoder.

Now you do the same thing with Person B and a separate network. This isn’t directly useful as is, because if you use a totally separate network, both networks are going to come up with different ways of “representing” an expression in that small set of numbers, so you can’t turn one face into the other, you’d just get garbage. Like, the “language” that one network uses to say “smiling and to the left” might mean “angry and looking up” to the other network.

To fix that, the trick is that you train both networks at once, and you actually use the *same* network for the encoder side.

So you simultaneously train for:
– An encoder (E) that can turn Person A’s face into some set of parameters
– A decoder (D1) that can turn those parameters back into Person A’s face
– The SAME encoder (E) that can turn Person B’s face into some set of parameters
– A decoder (D2) that can turn those parameters back into Person B’s face

And that’s the magic. Now you have a neural network that can “read” a face of *either* person, and two neural networks that can “create” each of the two faces based on those read parameters.

When person A smiles and looks to the left, the encoder spits out parameters that represent that, and then the decoder for person B can turn those into an image of person B smiling and looking to the left.

So once you’re done with the training, you just run a video through face detection, then the encoder, and then the result through the decoder for the *other* person, and insert the faces back into the original video.

Now you have a deepfake.

[Here’s](https://arstechnica.com/science/2019/12/how-i-created-a-deepfake-of-mark-zuckerberg-and-star-treks-data/) a nice Ars Technica article that goes into more details with diagrams, and [here](https://www.youtube.com/watch?v=R9OHn5ZF4Uo) is an excellent video intro to neural networks from CGP Grey. This [footnote](https://www.youtube.com/watch?v=wvWpdrfoEv0) is closer to the way the deepfake neural networks work.

Edit: a word