How AI Deepfakes Are Really Made

How AI Deepfakes Are Really Made | Hany Farid

syNN38cu3Vw • 2025-10-03

Transcript preview

Open

Kind: captions
Language: en
So let's talk about deep fakes which is
this sort of sliver of all of this.
&gt;&gt; Yeah.
&gt;&gt; So deep fakes is an umbrella term for
using machine learning AI to whole cloth
create images, audio and video of things
um that have never existed or happened.
So, for example, I can go to my favorite
deep fake generator and say, "Give me an
image of Hakee in a studio doing a
podcast with Professor Hani Fared
&gt;&gt; and actually would do a pretty good job
because you have a presence online. I
have somewhat of a presence online. It
knows what we look like and it would
generate an image that's not exactly
this, but something like that." Or I can
say, "Please, by the way, I still say
please when I ask AI for for things."
One of my students told me that this is
a good idea because when the AI
overlords come, they're going to
remember you were polite to them. Ah,
&gt;&gt; I actually really like this advice.
&gt;&gt; Wait a minute. So, I read an article.
&gt;&gt; Yes. It cost tens of millions of
dollars.
&gt;&gt; The energy ultimate. Yes. Just saying
please and thank you. I still do it by
the way. And even in my head right there
when I was asked when I was I I still in
my head say please.
&gt;&gt; Well, listen. I have AI connected to my
AI, right? And so my AI corrects my AI
prompts
&gt;&gt; to proper grammar and it's like
&gt;&gt; please. It puts please in there.
&gt;&gt; I know. And it does cost tens of
millions of dollars for that extra
token. Okay. So, I will ask it for an
image of a um of a unicorn wearing a red
clown hat um walking down the street of
Times Square and it will generate that
image. Um I can ask uh generate an audio
uh of Professor Hani Fared saying the
following, right?
&gt;&gt; Um I can generate a video of me saying
and doing things I never did. And you
can clearly see the power of that
technology from a creative perspective.
If you and I are having a conversation
and in post we said something we didn't
mean to, we can just fill it in with AI
now.
&gt;&gt; Well, here here's the thing that makes
me you just mentioned how we're only two
three years into this. So, however good
it is now, you know,
&gt;&gt; this is the worst it will ever be,
&gt;&gt; right?
&gt;&gt; So, if you look at the so I can tell
you, by the way, how good it is.
&gt;&gt; So, in addition to being trained as a
computer scientist and applied
mathematician, I've been somewhat
trained as a as a cognitive
neuroscientist. And we do perceptual
studies. So what we do is we recruit
participants. We show them images, audio
clips and video. And we tell them half
of the things you're going to look at
are real. Half of the things are AI
generated. We explain to them what AI
generated is. We give them examples of
that.
&gt;&gt; And for images as of last year, people
are roughly at chance at distinguishing
a real photo from an AI generated photo.
&gt;&gt; So what you mean by that is if they were
just if you had a a monkey behind a
keyboard,
&gt;&gt; flip flipping a coin.
&gt;&gt; Flipping a coin.
&gt;&gt; Yeah. Yeah. The monkeyy's probably
better than you, by the way. I'm I'm
going to go off and guess. Um, so with
audio, so we play a clip of somebody
speaking like you and we play an AI
generated version. They're slightly
above chance, not like 65%.
&gt;&gt; On image at chance at audio slightly
better than chance and video, they're a
little bit better, but all of those
trends are going towards chance. So
here's what we know. everything in the
next 12 months, 18 months, 24 months, I
don't know what the number is,
&gt;&gt; it will be indistinguishable to the
average person online, right? And that
is
&gt;&gt; that is a weird world we're living in
because think about how much in first of
all, the vast majority of Americans now
get the the the majority of their
information from online sources and
unfortunately from social media too.
&gt;&gt; And that and because it is so easy to
create this content, understand all this
is is a text prompt away. I type,
"Please give me an image of this,
generate this audio, generate this
video." There are dozens of services
that will do this extremely inexpensive
or for free. And you can carpet bomb the
internet with fake images of the
conflict in uh Gaza.
&gt;&gt; Fake images.
&gt;&gt; I have seen them too. Fake images of the
flood in Texas. Fake images and video of
the fires in name it across the boards,
right? Fake images of people stuffing
ballot boxes. Now we have a threat to
our democracy.
&gt;&gt; Wow. So suddenly our sense of reality
coming back to your first very good
question is up in the air because I can
create whatever reality I want and
understand that there's sort of three
things happening here when we talk about
deep fakes. There's the creation of it.
That's what we've been talking about.
&gt;&gt; There's the distribution which we
democratized 20 years ago. So anybody
can
&gt;&gt; publish to the world and that's very
powerful and very terrifying because
there's no editorial standards on social
media. And then there's the
amplification that we have become so
polarized as a society that when you see
things that conform to your world view,
you are more than happy to click like,
reshare, and now you have creation,
distribution, amplification.
&gt;&gt; Wow.
&gt;&gt; That's the ball game,
&gt;&gt; right? That's the ballgame for spreading
massive lies, conspiracies, and
disinformation campaigns that affect our
global health, our planet's health, our
democracy, our economy, everything.
Everything. So let's get into how these
fakes are generated. So start with
images.
&gt;&gt; Good. So let's start with images because
in some ways it's the easiest one, but
all of these have a similar theme. And
one of my favorite techniques for
generating images called a generative
adversarial network or a GAN. And here's
how it works.
&gt;&gt; Wait a minute. Wait a minute.
Adversarial.
&gt;&gt; Adversarial.
&gt;&gt; So that means that you're fighting your
computer.
&gt;&gt; Two computer two computer systems are
fighting each other. And this is sort of
the genius of this technique. So here's
how it works.
&gt;&gt; You have two systems.
One system's job is to make an image of
a person or a landscape or whatever you
want. Yeah. And so what it does, it
starts by, this is literally true, it
just splats down a bunch of random
pixels. So I say, generate an image of a
of a person and it says, "Okay, here's a
bunch of so so think uh the monkeys at
the keyboard typing randomly. Let's see
if this is Shakespeare,
&gt;&gt; right? And then it takes that image and
it hands it to a second system and it
says, "Is this a face?" And that system
has access to millions and millions of
images that it scraped from the internet
that are faces.
&gt;&gt; I see.
&gt;&gt; And that system says, "That thing that
you generated doesn't look like these
things over here."
&gt;&gt; And it gives the feedback to the
generator and it says, "Nope, try again.
&gt;&gt; Modify some pixels. Send it back to
what's called the discriminator. Is it a
face? No. Try again."
&gt;&gt; And they work in this adversarial loop.
So, it's like somebody's checking your
homework.
&gt;&gt; But it it seems like it could get stuck
never getting to a face.
&gt;&gt; You would think, and that's what's
amazing about the GANs, the is that they
converge.
&gt;&gt; They converge.
&gt;&gt; And part of that is the way they they've
been trained. But that's what's the
genius of this is that the generator is
not very smart because all it's doing is
modifying pixels. And the discriminator
is actually quite simple. It's simply
saying, does this thing look like these
things? And because you pit them against
each other in this adversarial game,
this sort of amazing thing happens out
the other side.
&gt;&gt; So here's the question. In on average,
how many iterations does it take? And
then how much time does that translate
to?
&gt;&gt; That's a great question. So typically
the time is in seconds.
&gt;&gt; So there's two phases. There's you train
the GANs. That's a really long process.
But then what we call inference, which
is that run this thing, it happens in
seconds. And the reason it happens in
seconds is by the way that is hundreds
of thousands of iterations but it's on a
GPU which is very powerful and very
fast. And then there's these tricks to
make it even faster. You start with
small images and then you make them
bigger over time. So there's these
tricks to make but it is literally
seconds to make that image.
&gt;&gt; Wow.
&gt;&gt; And what the brilliance of that is the
two systems are competing with each
other.
&gt;&gt; Um and then this thing that seems like
intelligence come out even though it's
not. If you think about those two
individual components,
&gt;&gt; they're pretty basic. pretty dumb.
&gt;&gt; But then you have this like emergent
behavior almost. It's like you know how
to generate images of people. That's
amazing.
&gt;&gt; So let's have a little fun.
&gt;&gt; I understand good
&gt;&gt; that you brought me some fakes and some
real images.
&gt;&gt; Good
&gt;&gt; to put to the test.
&gt;&gt; Good.
&gt;&gt; To see if I can
&gt;&gt; discern the difference.
&gt;&gt; So before I I'm going to play for you a
couple of audios. Before I do this, let
me say I've been doing this for a long
time and I've been I'm pretty good at
it. I'm pretty good at what I do. And I
had created three audio samples. I'm
going to play them for you.
&gt;&gt; Wait, are you allowed to say that that
you're you're good at what you do? I'll
say that. Connie is really good. That's
right.
&gt;&gt; I said pretty good, by the way.
&gt;&gt; She's amazing.
&gt;&gt; But this is amaz This is this is this is
a true story, by the way. So, I made
three audio clips for you of me talking.
And you and I have been talking for a
little while, so you now know what my
voice sounds like.
&gt;&gt; And uh I got off the plane and I was in
the car coming over here and I wanted to
make sure they worked. And I played all
three of them. And I couldn't tell which
one of me was real or fake. I wasn't
100% sure. Wow.
&gt;&gt; And I do this for a living and it's my
voice,
&gt;&gt; right?
&gt;&gt; So, okay. So, that is Okay.
&gt;&gt; So, wait a minute. Which AI did you use?
This was something that you created or
something generally available.
&gt;&gt; So, so here's the thing you have to
understand about AI. This is so readily
available. So, here's what I did. I went
to a service. It's a commercial service.
Um, I uploaded I think it was about 3
minutes of my voice.
&gt;&gt; I said please um uh please clone my
voice. Um and it clones my voice. And by
what I mean by that is that it learns
the patterns of my voice. what I sound
like, the intonation, my cadence, how
fast I speak, where I put the pauses,
&gt;&gt; and then I can simply type
&gt;&gt; and have it say anything I want to say.
&gt;&gt; And so I'm going to I'm going to read
I'm going to have you play I'm going to
listen have you listen to three
sentences.
&gt;&gt; Okay.
&gt;&gt; Um and one of them is f I'm going to
give you a hint. One of them is fake and
two are real. Okay.
&gt;&gt; Okay. And let's see what you we can do.
Okay. Here we go.
&gt;&gt; And in fairness, this is not the best uh
speaker, but Okay.
&gt;&gt; Are there guard rails in our law?
&gt;&gt; Ah, good. Uh, so first of all, when I
went to do this this service, um, I
uploaded my voice and there's a button
that says, "Do you have permission to
use this person's voice?" And and I did
because it was my voice, but I can
upload anybody's voice and click a
button.
&gt;&gt; The laws are very complicated and they
actually vary state-tostate and of
course internationally. Wow.
&gt;&gt; So there are almost no guardrails on
grabbing people's likeness and even if
there were,
&gt;&gt; there's
&gt;&gt; you can still do it anyway.
&gt;&gt; There's there's no stopping this.
There's no stopping it. Okay. All right.
Number one. Oh, and by the way, the the
three U this is part of a talk I gave
recently on deep fakes. So, you'll hear
a consecutive thing. Okay. Ready?
&gt;&gt; And if you invite me back next year,
almost certainly everything will have
changed. Uh the nature of creation of
deep fakes, the risk of deep fakes,
&gt;&gt; that's the deep fake right there, man.
&gt;&gt; Is changing.
&gt;&gt; Hold on. Hold on. That was good.
&gt;&gt; It is a fastmoving field and we have to
start thinking seriously and carefully
about the threat of misinformation.
&gt;&gt; Okay,
&gt;&gt; good. And one more. We are living
through an unprecedented time where we
are relying more and more on the
internet for information. For
information that affects our health, our
societies, our democracies, and our
economies.
&gt;&gt; Can I hear number one again?
&gt;&gt; Yep. You're a little less sure than you
were a minute ago.
&gt;&gt; Yeah.
&gt;&gt; And if you invite me back next year,
almost certainly everything will have
changed. Uh the nature of creation of
deep fakes, the risk of deep fakes, and
the detection of deep fakes is changing.
&gt;&gt; I think it's the first one still. I got
it right.
&gt;&gt; Yeah.
&gt;&gt; Yeah. I struggled with it, by the way.
Honestly, I couldn't remember. I'm from
the future.
&gt;&gt; You're the time traveler. It turns out.
&gt;&gt; Wow. Well, you know what? I So, I I
started my media work in audio, right?
Being a voice actor and and very quickly
I was able to pick up on music and
commercials and movies where they were
dropping in
&gt;&gt; uh you know, pickups. The the reason I
figured out is there's a difference in
the background noise. Like one had more
reverb than the other. Um which is how I
I I then remembered it. But you got to
admit all three of them sound like me.
&gt;&gt; Oh, they all do. They all sound like
you.
&gt;&gt; Oh, by the way, so not only can
&gt;&gt; Let let me tell you what has gotten me
recently is I'll get these uh social
media announcements. Oh, there's a new
song by Tupac and Eminem. And I start
listen to it and halfway in I'm like,
no, this is Yeah. But in the beginning
they it's coming from music. Yeah, it's
coming from the way. So, this is one of
my favorite videos by the way. Let me
just show this to you.
&gt;&gt; And if you invite me back next year,
almost certainly everything will have
changed. Uh the nature of the creation
of deep fakes, the risk of deep fakes,
that's real. Wait, wait for it.
I don't speak
and your mouth is doing it. I don't
speak Japanese.
Doesn't it sound like Indian?
&gt;&gt; Yes, it does.
&gt;&gt; I know. So, now I can do full-blown
video.
&gt;&gt; Any language. Any language. By the way,
here's what's really cool about this.
Here's a really cool application. I like
foreign films a lot, but I can't stand
bad lip syncing. It makes me crazy. But
you don't need it anymore.
&gt;&gt; You don't need it.
&gt;&gt; We're now going to make videos in any
language you want and it's going to be
perfect.
&gt;&gt; What? How did you do that? How? What?
&gt;&gt; This is also a commercial software. Um,
you upload a video, say that you have
permission to do it, and you say,
"Please translate this into Japanese,
Korean, Spanish, French, German,
anything you want."
&gt;&gt; It's amazing.
&gt;&gt; That is nuts. But the fact that the
mouth change to to voice the word,
&gt;&gt; by the way, the way this works, this is
really amazing, is you upload a video of
you talking and what it does is it takes
the audio and transcribes it. So, it
goes from audio to words
&gt;&gt; and then it translates from English to
Spanish and then it synthesizes a new
audio in Spanish and then it puts that
audio back into the video. Every one of
those is an AI system, by the way. And
it does that in about 3 minutes.
&gt;&gt; Wow.
&gt;&gt; And it's amazing. So, if you wanted to
take this podcast,
&gt;&gt; right,
&gt;&gt; and distribute it in Spanish, French,
German.
&gt;&gt; Yeah. Yeah.
&gt;&gt; Upload it.
&gt;&gt; And I'm just hitting India, China,
Southeast Asia,
&gt;&gt; two and a half billion people. Done.
Done. 10 cents each. We're good to go.

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang Anda berikan.

***

# Membedah Realitas Baru: Deepfake, Kloning Suara, dan Masa Depan AI dalam Masyarakat

### Inti Sari (Executive Summary)
Video ini membahas secara mendalam mengenai fenomena *deepfake* dan teknologi generative AI yang semakin canggih, mulai dari definisi teknis hingga dampak sosialnya. Pembicara menjelaskan bagaimana teknologi *Generative Adversarial Networks* (GANs) bekerja menciptakan konten yang sulit dibedakan dari aslinya, serta mendemonstrasikan kemampuan kloning suara dan terjemahan video otomatis. Diskusi menyoroti peluang besar dalam distribusi konten global sekaligus ancaman serius terhadap kebenaran informasi dan demokrasi akibat kemudahan penyebaran disinformasi.

### Poin-Poin Kunci (Key Takeaways)
*   **Definisi Deepfake**: Istilah *deepfake* mencakup segala bentuk media (gambar, audio, video) yang dibuat menggunakan AI dan Machine Learning untuk merepresentasikan hal yang sebenarnya tidak pernah terjadi.
*   **Batas Persepsi Manusia**: Studi menunjukkan manusia kini memiliki akurasi hanya sekitar 50/50 (acak) dalam membedakan gambar asli vs AI, sedikit lebih baik dalam audio (sekitar 65%), dan tren ini menuju ketidakmampuan total untuk membedakan dalam 12-24 bulan ke depan.
*   **Teknologi GANs**: AI bekerja melalui dua sistem yang saling berlawanan: *Generator* yang membuat gambar dari piksel acak dan *Discriminator* yang mengoreksi berdasarkan data jutaan gambar nyata.
*   **Dampak Sosial**: Kemudahan akses (gratis/murah) dan kurangnya standar editorial di media sosial memicu penyebaran berita palsu, konspirasi, dan ancaman terhadap integritas demokrasi.
*   **Kloning Suara**: Dengan rekaman suara hanya selama 3 menit, layanan komersial dapat mengkloning suara seseorang dengan sangat akurat, termasuk intonasi dan jeda, sehingga sulit dibedakan bahkan oleh pemilik suara aslinya.
*   **Terjemahan Video & Lip-Sync**: Teknologi kini memungkinkan penerjemahan video ke berbagai bahasa dengan sinkronisasi gerak bibir (*lip-sync*) yang sempurna, membuka peluang distribusi konten global dengan biaya sangat rendah.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar dan Definisi Deepfake
Video dibuka dengan penjelasan bahwa *deepfake* adalah istilah payung untuk penggunaan Machine Learning dan AI dalam menciptakan gambar, audio, atau video dari hal-hal yang tidak nyata. Contoh yang diberikan mencakup:
*   **Visual**: Gambar orang-orang (seperti Hakee & Prof. Hani Fared) di studio yang sebenarnya tidak ada, atau seekor Unicorn di Times Square.
*   **Audio**: Rekaman suara Prof. Hani Fared yang dihasilkan AI.
*   **Video**: Video pembicara melakukan hal-hal yang tidak pernah dilakukannya dalam kenyataan.

Terdapat juga sentuhan humor mengenai interaksi manusia dengan AI, di mana pembicara menyarankan untuk selalu mengucapkan "tolong" (*please*) kepada AI, mengingat ada kasus di mana AI memperbaiki perintah pengguna untuk menambahkan kata "tolong" secara otomatis.

#### 2. Kajian Persepsi dan Dampak Sosial
Pembicara, yang berlatar belakang ilmu komputer dan ilmu saraf kognitif, membagikan temuan dari studi perseptual:
*   **Akurasi Manusia**: Masyarakat saat ini berada pada tingkat "kebetulan" (50/50) dalam membedakan gambar asli dan AI. Untuk audio, akurasinya sedikit di atas kebetulan (sekitar 65%), namun trennya cepat menurun.
*   **Prediksi Masa Depan**: Dalam waktu 12 hingga 24 bulan ke depan, konten AI diprediksi akan tidak dapat dibedakan lagi oleh orang rata-rata.
*   **Ancaman Demokrasi**: Karena sebagian besar informasi masyarakat Amerika berasal dari media online, kemudahan membuat konten palsu (seperti gambar konflik di Gaza, banjir di Texas, atau pengisian kotak suara palsu) menjadi ancaman langsung. Tiga komponen yang memperparah hal ini adalah:
    1.  **Creation**: Pembuatan yang mudah dan murah.
    2.  **Distribution**: Distribusi yang didemokratisasi tanpa standar editorial jurnalistik.
    3.  **Amplification**: Amplifikasi melalui polarisasi sosial yang mendorong orang berbagi informasi tanpa verifikasi.

#### 3. Cara Kerja Teknis: GANs (Generative Adversarial Networks)
Bagian ini menjelaskan teknologi di balik keajaiban tersebut, yaitu GANs, yang melibatkan dua sistem yang saling berperang:
*   **Generator**: Sistem pertama bertugas menciptakan gambar dari piksel acak (dianalogikan seperti monyet mengetik Shakespeare).
*   **Discriminator**: Sistem kedua memeriksa hasil generator dengan membandingkannya dengan jutaan gambar wajah yang telah *di-scrape* (diambil) dari internet.
*   **Perulangan Adversarial**: Jika Discriminator mengatakan "Tidak, ini bukan wajah asli," Generator akan memodifikasi piksel dan mencoba lagi. Proses ini berulang ratusan ribu kali pada GPU yang cepat hingga keduanya bertemu pada titik konvergensi, menghasilkan gambar yang sangat realistis. Ini adalah contoh bagaimana bagian-bagian sederhana dapat menciptakan perilaku cerdas yang kompleks.

#### 4. Eksperimen Praktis: Kloning Suara
Transkrip beralih ke dialog demonstrasi antara pembicara (Connie) dan host mengenai kloning suara:
*   **Metode**: Connie menggunakan layanan komersial, mengunggah rekaman suaranya selama sekitar 3 menit, dan menekan tombol "clone".
*   **Hasil**: AI mempelajari pola, intonasi, nada, dan jeda suaranya. Connie mengakui bahwa bahkan dia sendiri, sebagai profesional dan pemilik suara tersebut, tidak bisa membedakan mana yang asli dan mana yang palsu saat mengujinya di mobil.
*   **Implikasi Hukum**: Layanan ini hanya menanyakan izin melalui satu klik tombol, yang sangat mudah untuk dilewati atau dibohongi. Hukum mengenai penggunaan rupa (*likeness*) seseorang sangat rumit dan bervariasi antar negara bagian/negara, membuatnya sulit untuk dihentikan.

#### 5. Inovasi Terjemahan Video dan Lip-Sync
Bagian terakhir membahas penerapan AI pada video:
*   **Kemampuan Baru**: AI kini dapat menerjemahkan video ke bahasa asing (seperti Jepang, Korea, Spanyol) dengan mengubah gerak bibir pembicara agar sesuai dengan ucapan dalam bahasa baru tersebut. Pembicara mendemonstrasikan dirinya berbicara dalam bahasa yang tidak dia kuasai dengan sinkronisasi bibir yang sempurna.
*   **Proses**: Prosesnya meliputi transkripsi audio, terjemahan, sintesis suara baru, dan penyesuaian gerak bibir video secara otomatis.
*   **Potensi Positif**: Teknologi ini sangat berguna untuk industri film (menghindari *lip-sync* yang buruk) dan distribusi konten global. Podcast atau video dapat didistribusikan ke miliaran orang di India, China, atau Asia Tenggara dengan biaya yang sangat rendah (sekitar 10 sen per video).

---

### Kesimpulan & Pesan Penutup
Teknologi *deepfake* dan AI generatif telah mencapai tingkat kemajuan yang menakutkan namun mengagumkan. Di satu sisi, teknologi ini menawarkan peluang luar biasa untuk melanggar batasan bahasa dan mendistribusikan pengetahuan secara global dengan biaya efisien. Namun, di sisi lain, kemampuan untuk menciptakan realitas palsu yang tidak dapat dibedakan dari yang asli menimbulkan ancaman eksistensial bagi kebenaran, kesehatan masyarakat, dan demokrasi. Tanpa pengawasan etis dan kebijakan yang tepat, kita menghadapi risiko "dustur besar" (*massive lies*) yang dapat dengan mudah diamplifikasi oleh masyarakat yang terpolarisasi.

Read

file updated 2026-02-13 12:55:27 UTC