Transcript
Gz3UcCENYsg • Steering LLMs: How to Change AI Personality Without Fine-Tuning
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0047_Gz3UcCENYsg.txt
Kind: captions Language: en So, you want to change an AI's personality, right? You'd probably think of two main ways. You either get really clever with your prompts, or you do some expensive, time-consuming fine-tuning. But what if there was a third way? A way to directly reach into the model's thoughts while they're actually happening. Well, today we're diving into this incredible technique for steering large language models, which lets us go way beyond the prompt and directly influence an AI's line of thought. All right, so check this out. You ask a standard LMA 3.18B model, who are you? And this is what you get. It's exactly what you'd expect, right? It's helpful. It's accurate. It's totally standard. But hold that thought because what if I told you that this response came from the exact same Llama 3.1 model? No fine-tuning, no special system prompt, nothing. The model suddenly genuinely seems to believe it's a large metal structure. How is that even possible? Yeah, that's the big question, right? And it's exactly what we're going to answer. We are going to unpack how you can fundamentally change a model's behavior at inference time. I mean, while it's literally in the middle of generating a response without touching a single one of its trained weights. Okay, so here's our game plan for figuring this whole thing out. First, we're going to borrow a really cool idea from neuroscience to make this all feel more intuitive. Then, we'll peek inside the LLM's brain to see how it thinks about concepts. After that, we'll get our hands dirty and see steering in practice. figure out where you even find these steering vectors. And finally, we'll get real about the incredible power and the limits of this technique. All right, let's dive in. And to really get a grip on this, we're actually going to start somewhere you might not expect, neuroscience. See, the whole idea of model steering is kind of like a real technique called neurosimulation. So, neuroscientists can actually use tiny electrodes or even magnetic fields to well to nudge specific parts of the brain. They can trigger a movement, bring up an emotion, or even spark a memory. It's used all the time in research to figure out how the brain works, and even in medicine to treat things like Parkinson's disease. And here is the key parallel for us. Neuro stimulation is something that happens on the fly. It doesn't permanently change the brain's wiring. And that is a perfect analogy for what we're about to do to an LLM. We're going to intervene in its thinking process without permanently changing its structure at all. So with that brain analogy fresh in our minds, let's make the jump over to the AI itself. How do these artificial neural networks actually represent abstract ideas? And more importantly, how can we mess with those representations? You know, inside a transformer model like Llama, information gets processed through this stack of layers. And as the data moves from one layer to the next, it's represented by this huge list of numbers, what we call a vector. This vector can have thousands of dimensions. And it's basically the model's internal state, its hidden thought at that precise moment. Okay. Now, this this is where the magic happens. Researchers found this incredible thing they call the linear representation phenomenon. It turns out that LLMs naturally on their own learn to represent concepts we can understand, you know, like love or royalty or even the Eiffel Tower as specific directions or vectors inside that massive hidden space. You've probably seen this classic example before, right? It showed that you could literally do math with word meanings. You take the vector for king, subtract man, add woman, and you end up right at the vector for queen. It's wild, and it shows that concepts aren't just random points. They exist in this logical structured way. The same exact principle is at play inside the deeper layers of a modern LLM. So, let's boil this down. The direction of the vector is the concept. the length or magnitude of that vector is the concept's intensity. And since they're just vectors, we can add and subtract them. Now, it's also really important to know that the middle layers of the model are usually the sweet spot for this stuff. The early layers are just dealing with basic grammar, and the final layers are just trying to format the output. The middle, that's where the abstract thinking is really happening. Okay, theory time is over. That's all super interesting. But how do we actually do this? How do we turn this idea of concept vectors into our own little neuro stimulator for an AI? Let's see what it looks like in practice. You're going to love this. The actual operation is shockingly simple. As the model is thinking, you take its current activation vector and you just add your concept vector to it. That's it. The coefficient here is just a number you pick to control how strong the effect is. Think of it like a volume dial for the concept of say the Eiffel Tower. And putting this into practice with a library like hugging face transformers is literally just a few steps. You load your model, you load your vector, and then you create this little function called a hook. You attach that hook to a specific layer like layer 15. And its only job is to add your steering vector to the activations every single time they pass through. Then you just generate text like normal. So let's see what this actually looks like. We ask the base llama model for business ideas and we get normal stuff. E-commerce services. Now we apply our Eiffel Tower vector, but with a pretty low coefficient, just a 4.0. And look, the model's ideas subtly shift to food and bakeries. It's not screaming Paris, but you can feel that the perspective has been nudged. And now the punch line. We crank that dial way up to a coefficient of 8.0. And we ask, "Who are you again?" And there it is. The model completely takes on the new persona. What's so amazing is that the original response started with I'm a large language model, but the steered one starts with I'm a large metal structure. The steering literally changed the model's mind mid thought. Okay, this is just incredible, right? But it leads to a huge question. Where in the world do you get these magical steering vectors from? I mean, how did anyone find the exact direction for Eiffel Tower inside the model's hidden space? Well, there are basically two main ways to do it. The first is called contrastive activation. That's where you'd show the model a bunch of text that has the concept you want and a bunch of text that doesn't. You then find the difference in the model's internal activations and boom, that's your vector. The other newer method uses something called sparse autoenccoders. You can think of this as a special tool that can sift through all the models messy internal thoughts and automatically pull out a whole library of individual concepts. And here's the best part. You don't have to do all this heavy lifting yourself. There are tools like Neuronipedia where you can literally just browse through concepts that people have already discovered and visualized. And over on the hugging face hub, the community is sharing these pre-trained autoenccoders and steering vectors that you can just download and start playing with right away. So, it's pretty clear this is a seriously powerful technique. But like any tool, it's not magic. It's really important to understand what steering is great at and what it well it can't do. On the plus side, the benefits are huge. You don't need to do any expensive fine-tuning. It works instantly. You can dial the intensity up or down with just one number and the effect holds up really well. But then you have the cons. Finding that perfect coefficient can be tricky. If you turn it up too high, the model can just start spouting nonsense. And here's the big one. Steering cannot teach the model new information. If the model has never learned about a concept, you can't just invent a vector for it. You can only amplify what's already in there. And all of this leaves us with a really big kind of mindbending question for the future. We're moving past just talking to AIS with prompts. We are now building the tools to perform a kind of micro surgery on their internal ideas. As this tech gets better and better, what is that going to mean for how we control and align the truly powerful AI systems that are coming?