How AI Agents Actually Work: Building One From Scratch (No Frameworks)
zAfsz94ka7s • 2026-01-11
Transcript preview
Open
Kind: captions Language: en Welcome to the explainer. Today we are building an AI agent completely from scratch. And when I say from scratch, I mean it. We're using absolutely no frameworks here so you can see exactly what's going on under the hood. Look, our goal isn't to build the next production ready app. It's to get our hands on the engine itself to see every single gear turn. By the end of this, you're going to understand the core logic that powers every single AI agent out there. All right, so here's our road map for today. First, we'll start with the absolute basics. What even is an AI agent? How is it different from a regular chatbot? Then we'll map out the agentic workflow. This is the fundamental communication pattern that makes everything tick. After that, we get our hands dirty. We'll set up our environment, create some custom tools for our agent, build the agent class itself, the real engine, and then we'll put it all together for a live demo. So, to really get why AI agents are such a huge deal, we got to take a quick look back at how we used to talk to large language models not too long ago. The shift from then to now, well, it's the entire reason agents are so powerful and are literally changing how we interact with computers. This slide just nails the entire evolution. Over on the left, you've got the 2022 model, the classic Q&A. You ask a question, and the LLM digs through its training data, its internal library, and gives you an answer. But here's the catch. It was a closed system. Its knowledge was totally frozen in time. Now [snorts] look at the right. That's today's agent model. It's a whole different ballgame. The LLM can now use tools. It can reach out, access real world, up totheminute info and actually do things for you. It's a massive jump from being a passive encyclopedia to an active assistant. Okay, so this is the absolute key definition. At its heart, an AI agent is just an LLM application that can execute tools. That's it. That's the secret sauce. The ability to call a function, ping an API, or run a database query is what separates an agent from a simple chatbot. It's what lets it break free from its training data and play with live real world information. You know, it's the difference between asking a dusty encyclopedia question and asking a research assistant to go find you the latest answer from anywhere on the internet. And this is the perfect way to think about what we're doing today. We're not just going to look at the agent's final answer. That's like looking at the face of a clock. It tells you the right time, but you have no idea how. Nope. We're going to pry the back off this thing and watch every single gear, every spring, every little part move. We want to understand the mechanics of how an agent thinks, how it picks a tool, and what it does with the information it gets back. Trust me, knowing this is absolutely crucial when you start building more complex systems. Okay, let's look at the blueprint for our agent. This workflow is the absolute key to how any agent operates. from the simple one we're building today to the most complex systems you can imagine. It's this back and forth communication, this little dance between you, the application, and the language model that makes all the magic happen. This slide shows the beautiful division of labor here. Let's walk through it. Step one, you ask a simple question. Step two, our application sends your question to the LLM, but and this is so important, it also sends a list of all the tools it knows how to use. The LLM acts as the brain. It looks at your question and goes, "Aha, to answer this, I need the get temperature tool." Now, step three is critical. The LLM doesn't run the tool. It can't. It sends a message back to our application telling it what to run. Our app is the hands. Step four, we run the function, get back a result, in this case, the number 72. And we send that result right back to the LLM. And finally, step five, the LLM takes that new piece of data and crafts it into a perfect, natural sounding sentence. All right, we've got our blueprint. It's time to start building. The very first step is to get our environment set up and make that connection to a language model. And this isn't just a boring formality. Picking the right model and understanding how we're going to talk to it is the foundation for everything else we're about to do. Okay, let's break down what's happening in the code here. We're using the hugging face hub library and the inference client is our workhorse. Think of it as our gateway to the model. It handles all the messy stuff. Formatting our requests, authenticating with our API token, and parsing the response. We just need to give it our token, which is like our password, and tell it which model we want to use. And this part is vital. The model has to support function calling or tool use. This means it's been specially trained to recognize when a tool could help answer a question and to respond with that structured request. Not all models can do this, so picking the right one is step one. Before we give our agent its superpowers, let's do a quick baseline test. We're just making a standard call to the model asking a simple question. And you can see on the right the tool calls part of the response is none. This is the model literally telling us, hey, I looked at your question and I don't need any tools to answer it. It's just using its internal knowledge just like the old Q&A models. This confirms our connection is working and gives us a really clear before picture. All right, now for what is in my opinion the most exciting part of the setup, actually building the components. And for an agent, the most important component by far is its tool. This is where we stop just talking and start doing. And would you look at that? This shows just how simple a tool can be. It's it's just a regular Python function. Now, this is a fake one obviously, but just imagine the possibilities inside this function. You could be calling a real-time weather API. You could be connecting to a database to run a query. You could use the Gmail API to search your inbox or the Google Calendar API to create an event. Seriously, anything you can program in Python can be wrapped up like this and turned into a powerful tool for your agent. So, we have this cool Python function, right? But how in the world does the LLM, which only understands text, know that our function even exists, let alone how to use it? The answer is a tool schema. Think of it like an instruction manual for the function written in a language the LLM can perfectly understand. It's a chunk of JSON that describes everything. the exact function name, a clear description of what it does, and exactly what kind of arguments it needs to work. Now, you could absolutely write this schema by hand as a big JSON string. But trust me, you do not want to do that. It is tedious. It's long, and it's so easy to make one tiny typo that breaks the whole thing. A much much better way is to use a library like Pyantic. It lets you define your tools arguments in clean, readable Python and then it generates the perfectly formatted error-free JSON for you. It's just it's the professional way to do it. And just look at how clean this code is. We define a simple class. We declare our arguments and their types. City is a string. And then we add a description. Now, this description is incredibly important. It's not a comment for you or other developers. This is the exact text the LLM will read to figure out what kind of information to put in that city field. A good clear description is the key to getting the model to use your tool correctly. And then boom, one line of code at the bottom and Pantic does all the heavy lifting for us. Okay, we have our blueprint, the workflow, we have our main components, the tool, and its instruction manual, the schema. Now, it's time to assemble the engine. We're going to build a Python class that will orchestrate this entire dance. It's going to manage the conversation, call the LLM, and execute the tools. This is where all that logic comes to life. The heart and soul of our agent is a loop. It's a continuous cycle. Step one, we send the whole conversation so far, plus our list of tools, to the LLM. Step two, we look at what it sends back, and we only care about one question. Did it ask to use a tool? If the answer is yes, we run the tool, add the result to our conversation history, and immediately go right back to step one. sending the newly updated history. If the answer is no, that means the LLM is done thinking. It has everything it needs. So, we grab its final text response and we break out of the loop. And here's that exact logic in Python. We've got a while true loop that will just keep running. Inside, we call the model and then we check. Does the response have tool calls? If it does, we do our work and the loop continues. If not, we have our final answer. So, we return it and break the loop. This whole structure, the while loop, the if else check, the message history management. This is what we call boilerplate code. It's the stuff you have to write every single time. And it's exactly what frameworks like lang chain or small agents are designed to handle for you. So what's happening inside that if block? This is where our application puts on its work gloves and acts as the hands. We get the name of the function the LLM wants to run and the arguments it provided. We then find our actual Python function that matches that name. We run it with those arguments and we get the output. Then we package that output into a special tool message and tag it on to the end of our conversation history. This is the step that closes the loop and gives the LLM the real world info it asks for. The engine is assembled. All the components are in place. All the logic is written. It is time for the moment of truth. Let's fire this thing up and see our brand new from scratch agent in action. This is where all the theory gets real. All right, to kick things off, we'll create our agent, give it the get temperature tool, and then we'll give it this nice, easy pitch right over the plate. Let's see if our blueprint actually works. And bingo. Look at that. This is a log message from inside our agent's brain, the behind-the-scenes view. Our agent got the prompt, the LLM correctly decided to call our tool, and our agent correctly parsed the city as San Francisco and ran our function. It got the output, 72. Every single step of the blueprint worked like a charm. So after the tool ran, our agent sent the result, that number 72, back to the LLM. The LLM then took that new information and generated this beautiful human readable sentence. This is the final output that the user actually sees. It's the face of the clock showing the right time and it's powered by all that cool machinery we just built. Okay, now let's pull back the curtain one last time. This table, this is the agent's complete internal memory from that one single question. This is its entire thought process. It starts with a system prompt, then our user message. Then notice the assistant's first reply isn't text, it's a tool call. [snorts] Our app then adds the tool message with the result 72. And only then, with all the facts in hand, does the assistant give the final text answer. Our five-step workflow is laid out right here, plain as day. And there you have it. We did it. We built a simple but a fully functional AI agent completely from scratch. And doing this is so powerful because it completely demystifies the whole process. You now understand the fundamental logic that's humming away inside even the most complicated agentic systems out there. Now, of course, in the real world, you're not going to write all this boilerplate code every time. You'd use a framework, something like small agents from Hugging Face. They handle all the boring stuff, the looping, the history, the schemas, so you can just focus on building awesome tools. If you want to see how that's done, make sure you subscribe because we'll definitely be covering that in a future explainer. So, let's recap the deep dive. Agents are just LLMs with tools. They run on a simple loop. The LLM thinks and our act acts. Schemas are the critical instruction manuals for those tools. The agents real memory is a full transcript of this entire thought process. And most importantly, because we built this from scratch, you now have a rock-solid foundation for building and more importantly debugging with the big production frameworks. We've seen the blueprint. We've assembled the engine. This isn't just theory anymore. It's a practical foundation. So, the question I want to leave you with is, what's the first tool you would build? A tool to organize your calendar? A tool to summarize your unread emails? The possibilities are literally endless. Thanks for watching the explainer, and don't forget to subscribe for more deep dives just like this one.
Resume
Categories