How AI Agents Actually Work: Building One From Scratch (No Frameworks)

zAfsz94ka7s • 2026-01-11

FoundationModelsForRobotics YouTube Transcript

Transcript preview

Open

Kind: captions
Language: en
Welcome to the explainer. Today we are
building an AI agent completely from
scratch. And when I say from scratch, I
mean it. We're using absolutely no
frameworks here so you can see exactly
what's going on under the hood. Look,
our goal isn't to build the next
production ready app. It's to get our
hands on the engine itself to see every
single gear turn. By the end of this,
you're going to understand the core
logic that powers every single AI agent
out there. All right, so here's our road
map for today. First, we'll start with
the absolute basics. What even is an AI
agent? How is it different from a
regular chatbot? Then we'll map out the
agentic workflow. This is the
fundamental communication pattern that
makes everything tick. After that, we
get our hands dirty. We'll set up our
environment, create some custom tools
for our agent, build the agent class
itself, the real engine, and then we'll
put it all together for a live demo. So,
to really get why AI agents are such a
huge deal, we got to take a quick look
back at how we used to talk to large
language models not too long ago. The
shift from then to now, well, it's the
entire reason agents are so powerful and
are literally changing how we interact
with computers. This slide just nails
the entire evolution. Over on the left,
you've got the 2022 model, the classic
Q&A. You ask a question, and the LLM
digs through its training data, its
internal library, and gives you an
answer. But here's the catch. It was a
closed system. Its knowledge was totally
frozen in time. Now [snorts] look at the
right. That's today's agent model. It's
a whole different ballgame. The LLM can
now use tools. It can reach out, access
real world, up totheminute info and
actually do things for you. It's a
massive jump from being a passive
encyclopedia to an active assistant.
Okay, so this is the absolute key
definition. At its heart, an AI agent is
just an LLM application that can execute
tools. That's it. That's the secret
sauce. The ability to call a function,
ping an API, or run a database query is
what separates an agent from a simple
chatbot. It's what lets it break free
from its training data and play with
live real world information. You know,
it's the difference between asking a
dusty encyclopedia question and asking a
research assistant to go find you the
latest answer from anywhere on the
internet. And this is the perfect way to
think about what we're doing today.
We're not just going to look at the
agent's final answer. That's like
looking at the face of a clock. It tells
you the right time, but you have no idea
how. Nope. We're going to pry the back
off this thing and watch every single
gear, every spring, every little part
move. We want to understand the
mechanics of how an agent thinks, how it
picks a tool, and what it does with the
information it gets back. Trust me,
knowing this is absolutely crucial when
you start building more complex systems.
Okay, let's look at the blueprint for
our agent. This workflow is the absolute
key to how any agent operates. from the
simple one we're building today to the
most complex systems you can imagine.
It's this back and forth communication,
this little dance between you, the
application, and the language model that
makes all the magic happen. This slide
shows the beautiful division of labor
here. Let's walk through it. Step one,
you ask a simple question. Step two, our
application sends your question to the
LLM, but and this is so important, it
also sends a list of all the tools it
knows how to use. The LLM acts as the
brain. It looks at your question and
goes, "Aha, to answer this, I need the
get temperature tool." Now, step three
is critical. The LLM doesn't run the
tool. It can't. It sends a message back
to our application telling it what to
run. Our app is the hands. Step four, we
run the function, get back a result, in
this case, the number 72. And we send
that result right back to the LLM. And
finally, step five, the LLM takes that
new piece of data and crafts it into a
perfect, natural sounding sentence. All
right, we've got our blueprint. It's
time to start building. The very first
step is to get our environment set up
and make that connection to a language
model. And this isn't just a boring
formality. Picking the right model and
understanding how we're going to talk to
it is the foundation for everything else
we're about to do. Okay, let's break
down what's happening in the code here.
We're using the hugging face hub library
and the inference client is our
workhorse. Think of it as our gateway to
the model. It handles all the messy
stuff. Formatting our requests,
authenticating with our API token, and
parsing the response. We just need to
give it our token, which is like our
password, and tell it which model we
want to use. And this part is vital. The
model has to support function calling or
tool use. This means it's been specially
trained to recognize when a tool could
help answer a question and to respond
with that structured request. Not all
models can do this, so picking the right
one is step one. Before we give our
agent its superpowers, let's do a quick
baseline test. We're just making a
standard call to the model asking a
simple question. And you can see on the
right the tool calls part of the
response is none. This is the model
literally telling us, hey, I looked at
your question and I don't need any tools
to answer it. It's just using its
internal knowledge just like the old Q&A
models. This confirms our connection is
working and gives us a really clear
before picture. All right, now for what
is in my opinion the most exciting part
of the setup, actually building the
components. And for an agent, the most
important component by far is its tool.
This is where we stop just talking and
start doing. And would you look at that?
This shows just how simple a tool can
be. It's it's just a regular Python
function. Now, this is a fake one
obviously, but just imagine the
possibilities inside this function. You
could be calling a real-time weather
API. You could be connecting to a
database to run a query. You could use
the Gmail API to search your inbox or
the Google Calendar API to create an
event. Seriously, anything you can
program in Python can be wrapped up like
this and turned into a powerful tool for
your agent. So, we have this cool Python
function, right? But how in the world
does the LLM, which only understands
text, know that our function even
exists, let alone how to use it? The
answer is a tool schema. Think of it
like an instruction manual for the
function written in a language the LLM
can perfectly understand. It's a chunk
of JSON that describes everything. the
exact function name, a clear description
of what it does, and exactly what kind
of arguments it needs to work. Now, you
could absolutely write this schema by
hand as a big JSON string. But trust me,
you do not want to do that. It is
tedious. It's long, and it's so easy to
make one tiny typo that breaks the whole
thing. A much much better way is to use
a library like Pyantic. It lets you
define your tools arguments in clean,
readable Python and then it generates
the perfectly formatted error-free JSON
for you. It's just it's the professional
way to do it. And just look at how clean
this code is. We define a simple class.
We declare our arguments and their
types. City is a string. And then we add
a description. Now, this description is
incredibly important. It's not a comment
for you or other developers. This is the
exact text the LLM will read to figure
out what kind of information to put in
that city field. A good clear
description is the key to getting the
model to use your tool correctly. And
then boom, one line of code at the
bottom and Pantic does all the heavy
lifting for us. Okay, we have our
blueprint, the workflow, we have our
main components, the tool, and its
instruction manual, the schema. Now,
it's time to assemble the engine. We're
going to build a Python class that will
orchestrate this entire dance. It's
going to manage the conversation, call
the LLM, and execute the tools. This is
where all that logic comes to life. The
heart and soul of our agent is a loop.
It's a continuous cycle. Step one, we
send the whole conversation so far, plus
our list of tools, to the LLM. Step two,
we look at what it sends back, and we
only care about one question. Did it ask
to use a tool? If the answer is yes, we
run the tool, add the result to our
conversation history, and immediately go
right back to step one. sending the
newly updated history. If the answer is
no, that means the LLM is done thinking.
It has everything it needs. So, we grab
its final text response and we break out
of the loop. And here's that exact logic
in Python. We've got a while true loop
that will just keep running. Inside, we
call the model and then we check. Does
the response have tool calls? If it
does, we do our work and the loop
continues. If not, we have our final
answer. So, we return it and break the
loop. This whole structure, the while
loop, the if else check, the message
history management. This is what we call
boilerplate code. It's the stuff you
have to write every single time. And
it's exactly what frameworks like lang
chain or small agents are designed to
handle for you. So what's happening
inside that if block? This is where our
application puts on its work gloves and
acts as the hands. We get the name of
the function the LLM wants to run and
the arguments it provided. We then find
our actual Python function that matches
that name. We run it with those
arguments and we get the output. Then we
package that output into a special tool
message and tag it on to the end of our
conversation history. This is the step
that closes the loop and gives the LLM
the real world info it asks for. The
engine is assembled. All the components
are in place. All the logic is written.
It is time for the moment of truth.
Let's fire this thing up and see our
brand new from scratch agent in action.
This is where all the theory gets real.
All right, to kick things off, we'll
create our agent, give it the get
temperature tool, and then we'll give it
this nice, easy pitch right over the
plate. Let's see if our blueprint
actually works. And bingo. Look at that.
This is a log message from inside our
agent's brain, the behind-the-scenes
view. Our agent got the prompt, the LLM
correctly decided to call our tool, and
our agent correctly parsed the city as
San Francisco and ran our function. It
got the output, 72. Every single step of
the blueprint worked like a charm. So
after the tool ran, our agent sent the
result, that number 72, back to the LLM.
The LLM then took that new information
and generated this beautiful human
readable sentence. This is the final
output that the user actually sees. It's
the face of the clock showing the right
time and it's powered by all that cool
machinery we just built. Okay, now let's
pull back the curtain one last time.
This table, this is the agent's complete
internal memory from that one single
question. This is its entire thought
process. It starts with a system prompt,
then our user message. Then notice the
assistant's first reply isn't text, it's
a tool call. [snorts] Our app then adds
the tool message with the result 72. And
only then, with all the facts in hand,
does the assistant give the final text
answer. Our five-step workflow is laid
out right here, plain as day. And there
you have it. We did it. We built a
simple but a fully functional AI agent
completely from scratch. And doing this
is so powerful because it completely
demystifies the whole process. You now
understand the fundamental logic that's
humming away inside even the most
complicated agentic systems out there.
Now, of course, in the real world,
you're not going to write all this
boilerplate code every time. You'd use a
framework, something like small agents
from Hugging Face. They handle all the
boring stuff, the looping, the history,
the schemas, so you can just focus on
building awesome tools. If you want to
see how that's done, make sure you
subscribe because we'll definitely be
covering that in a future explainer. So,
let's recap the deep dive. Agents are
just LLMs with tools. They run on a
simple loop. The LLM thinks and our act
acts. Schemas are the critical
instruction manuals for those tools. The
agents real memory is a full transcript
of this entire thought process. And most
importantly, because we built this from
scratch, you now have a rock-solid
foundation for building and more
importantly debugging with the big
production frameworks. We've seen the
blueprint. We've assembled the engine.
This isn't just theory anymore. It's a
practical foundation. So, the question I
want to leave you with is, what's the
first tool you would build? A tool to
organize your calendar? A tool to
summarize your unread emails? The
possibilities are literally endless.
Thanks for watching the explainer, and
don't forget to subscribe for more deep
dives just like this one.

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari konten video mengenai pembuatan AI Agent dari nol.

***

# Membongkar "Mesin" AI Agent: Panduan Lengkap Membangun Agen Cerdas dari Nol Tanpa Framework

### Inti Sari (Executive Summary)
Video ini membahas proses pembuatan **AI Agent dari nol (from scratch)** tanpa menggunakan framework siap pakai, bertujuan untuk memahami logika dan mekanisme dasar yang bekerja di balik teknologi tersebut. Pembahasan mencakup perbedaan antara chatbot klasik dengan AI agent modern, alur kerja (*workflow*) penggunaan *tools*, serta implementasi kode Python untuk menghubungkan Large Language Model (LLM) dengan fungsi eksternal. Tutorial ini menekankan pentingnya memahami "roda gigi" di balik sistem agar pengembang dapat lebih baik dalam mendesain dan melakukan *debugging* pada aplikasi produksi.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Definisi AI Agent:** AI Agent adalah aplikasi LLM yang memiliki kemampuan untuk mengeksekusi *tools* (memanggil fungsi, API, atau kueri database), berbeda dengan chatbot klasik yang hanya menjawab berdasarkan pengetahuan statis.
*   **Alur Kerja (Workflow):** Proses agent melibatkan "tarian" antara LLM (otak) dan kode aplikasi (tangan), di mana LLM memutuskan *tool* apa yang digunakan, aplikasi menjalankannya, dan hasilnya dikembalikan ke LLM untuk diformat.
*   **Struktur Kode:** Inti dari sebuah agent adalah *loop* (perulangan) yang terus menerus memeriksa apakah LLM membutuhkan *tool* atau sudah selesai menjawab.
*   **Pentingnya Skema:** *Tools* didefinisikan sebagai fungsi Python biasa, tetapi memerlukan deskripsi dan skema argumen (biasanya menggunakan JSON dan bantuan *library* Pydantic) agar LLM mengerti cara menggunakannya.
*   **Boilerplate Code:** Kode manual untuk *loop* agent bersifat repetitif (*boilerplate*). Dalam lingkungan produksi, disarankan menggunakan framework seperti **LangChain** atau **SmolAgents** dari Hugging Face untuk efisiensi.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Evolusi dan Konsep Dasar AI Agent
Video memulai dengan membandingkan evolusi model AI dari tahun 2022 hingga sekarang:
*   **Model 2022 (Kiri):** Bersifat Q&A klasik, sistem tertutup, dan pengetahuan yang "beku" (seperti ensiklopedia statis).
*   **Model Sekarang (Kanan):** Bersifat *Agentic*, mampu menggunakan *tools* untuk mengakses informasi *real-time*, dan bertindak sebagai asisten aktif.
*   **Analogi:** Membangun agent dari nol ibarat membuka bagian belakang jam untuk melihat roda giginya, bukan hanya melihat jarum jamnya. Tujuannya adalah memahami mesinnya, bukan membuat aplikasi produksi langsung.

#### 2. Cetak Biru Alur Kerja (The Workflow)
Mekanisme kerja agent dijelaskan melalui sebuah proses berulang:
1.  Pengguna mengajukan pertanyaan.
2.  Aplikasi mengirim pertanyaan dan daftar *tools* ke LLM.
3.  LLM memutuskan apakah perlu menggunakan *tool* (misalnya: cek suhu). Jika ya, LLM mengirim pesan permintaan (tidak menjalankannya langsung).
4.  Aplikasi (sebagai "tangan") menjalankan fungsi tersebut dan mendapatkan hasil (misalnya: 72 derajat).
5.  Hasil fungsi dikirim kembali ke LLM.
6.  LLM menyusun kalimat final yang alami berdasarkan data tersebut untuk pengguna.

#### 3. Persiapan Lingkungan dan Definisi Tools
*   **Setup:** Menggunakan *library* `hugging face hub` dan `inference client`. Model yang dipilih harus mendukung *function calling* atau penggunaan *tools*.
*   **Pengujian Dasar:** Panggilan standar ke LLM akan menghasilkan "tool calls: none" karena hanya menggunakan pengetahuan internalnya.
*   **Tools adalah Fungsi Python:** *Tools* pada dasarnya adalah fungsi Python biasa (seperti API cuaca, query Gmail, atau Kalender).
*   **Skema (Schema):** LLM membutuhkan "manual instruksi" dalam format JSON untuk setiap *tool*, yang berisi nama, deskripsi, dan argumen yang dibutuhkan.
*   **Penggunaan Pydantic:** *Library* Pydantic digunakan untuk menghasilkan JSON Schema secara otomatis dari kode Python, mengurangi risiko kesalahan penulisan manual dan menjaga kode tetap bersih. Deskripsi yang jelas sangat penting agar LLM tahu apa yang harus diisi dalam argumen.

#### 4. Membangun "Mesin" Agent (The Agent Class)
Bagian inti dari kode adalah kelas `Agent` yang mengatur orkestrasi:
*   **Logika Loop:** Menggunakan perulangan `while true`.
    1.  Mengirim riwayat percakapan dan daftar *tools* ke LLM.
    2.  Memeriksa respons: Apakah LLM meminta *tool*?
    3.  **Jika Ya:** Jalankan *tool*, ambil hasilnya, tambahkan ke riwayat percakapan, dan ulangi loop.
    4.  **Jika Tidak:** LLM dianggap selesai, ambil teks akhir, dan hentikan loop.

#### 5. Eksekusi dan Demonstrasi Langsung
Video menunjukkan demo agent dengan *tool* "get temperature":
*   Pengguna memberikan prompt.
*   LLM memanggil *tool* dan mem-parsing argumen "San Francisco".
*   Fungsi Python dijalankan dan mengembalikan output "72".
*   Aplikasi mengirim "72" kembali ke LLM.
*   LLM menghasilkan jawaban akhir yang dapat dibaca manusia.

#### 6. Memori Internal dan Struktur Data
*   **Memori Agent:** Riwayat percakapan disimpan seperti tabel yang berisi:
    1.  *System Prompt*
    2.  *User Message*
    3.  *Assistant Tool Call* (permintaan fungsi)
    4.  *Tool Message* (hasil eksekusi, yaitu 72)
    5.  *Assistant Final Text Answer* (jawaban akhir)
*   Struktur ini memungkinkan LLM mempertahankan konteks langkah demi langkah.

#### 7. Boilerplate Code dan Penggunaan Framework
*   **Boilerplate Code:** Struktur kode seperti *while loop*, *if/else*, dan manajemen riwayat pesan bersifat repetitif dan membosankan jika ditulis ulang setiap saat.
*   **Solusi Framework:** Dalam penggunaan nyata (produksi), disarankan menggunakan framework seperti **LangChain** atau **SmolAgents** (dari Hugging Face). Framework ini menangani kode boilerplate sehingga pengembang bisa fokus pada logika bisnis dan *tools*.

---

### Kesimpulan & Pesan Penutup
Video ini berhasil mendemistifikasi cara kerja AI Agent dengan membangunnya dari nol menggunakan Python. Meskipun kode yang ditulis manual sangat efektif untuk pembelajaran dan memahami dasar-dasar *debugging*, pengembang disarankan untuk menggunakan framework modern seperti SmolAgents atau LangChain saat membangun aplikasi *real-world*. Pemahaman tentang loop "berpikir-bertindak" (*think-act*), pentingnya skema *tools*, dan manajemen memori adalah fondasi yang kuat bagi siapa saja yang ingin terjun ke pengembangan AI Agent.

Read

file updated 2026-02-12 02:44:58 UTC