Transcript
Gz3UcCENYsg • Steering LLMs: How to Change AI Personality Without Fine-Tuning
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0047_Gz3UcCENYsg.txt
Kind: captions
Language: en
So, you want to change an AI's
personality, right? You'd probably think
of two main ways. You either get really
clever with your prompts, or you do some
expensive, time-consuming fine-tuning.
But what if there was a third way? A way
to directly reach into the model's
thoughts while they're actually
happening. Well, today we're diving into
this incredible technique for steering
large language models, which lets us go
way beyond the prompt and directly
influence an AI's line of thought. All
right, so check this out. You ask a
standard LMA 3.18B model, who are you?
And this is what you get. It's exactly
what you'd expect, right? It's helpful.
It's accurate. It's totally standard.
But hold that thought because what if I
told you that this response came from
the exact same Llama 3.1 model? No
fine-tuning, no special system prompt,
nothing. The model suddenly genuinely
seems to believe it's a large metal
structure. How is that even possible?
Yeah, that's the big question, right?
And it's exactly what we're going to
answer. We are going to unpack how you
can fundamentally change a model's
behavior at inference time. I mean,
while it's literally in the middle of
generating a response without touching a
single one of its trained weights. Okay,
so here's our game plan for figuring
this whole thing out. First, we're going
to borrow a really cool idea from
neuroscience to make this all feel more
intuitive. Then, we'll peek inside the
LLM's brain to see how it thinks about
concepts. After that, we'll get our
hands dirty and see steering in
practice. figure out where you even find
these steering vectors. And finally,
we'll get real about the incredible
power and the limits of this technique.
All right, let's dive in. And to really
get a grip on this, we're actually going
to start somewhere you might not expect,
neuroscience. See, the whole idea of
model steering is kind of like a real
technique called neurosimulation.
So, neuroscientists can actually use
tiny electrodes or even magnetic fields
to well to nudge specific parts of the
brain. They can trigger a movement,
bring up an emotion, or even spark a
memory. It's used all the time in
research to figure out how the brain
works, and even in medicine to treat
things like Parkinson's disease. And
here is the key parallel for us. Neuro
stimulation is something that happens on
the fly. It doesn't permanently change
the brain's wiring. And that is a
perfect analogy for what we're about to
do to an LLM. We're going to intervene
in its thinking process without
permanently changing its structure at
all. So with that brain analogy fresh in
our minds, let's make the jump over to
the AI itself. How do these artificial
neural networks actually represent
abstract ideas? And more importantly,
how can we mess with those
representations?
You know, inside a transformer model
like Llama, information gets processed
through this stack of layers. And as the
data moves from one layer to the next,
it's represented by this huge list of
numbers, what we call a vector. This
vector can have thousands of dimensions.
And it's basically the model's internal
state, its hidden thought at that
precise moment. Okay. Now, this this is
where the magic happens. Researchers
found this incredible thing they call
the linear representation phenomenon. It
turns out that LLMs naturally on their
own learn to represent concepts we can
understand, you know, like love or
royalty or even the Eiffel Tower as
specific directions or vectors inside
that massive hidden space. You've
probably seen this classic example
before, right? It showed that you could
literally do math with word meanings.
You take the vector for king, subtract
man, add woman, and you end up right at
the vector for queen. It's wild, and it
shows that concepts aren't just random
points. They exist in this logical
structured way. The same exact principle
is at play inside the deeper layers of a
modern LLM. So, let's boil this down.
The direction of the vector is the
concept. the length or magnitude of that
vector is the concept's intensity. And
since they're just vectors, we can add
and subtract them. Now, it's also really
important to know that the middle layers
of the model are usually the sweet spot
for this stuff. The early layers are
just dealing with basic grammar, and the
final layers are just trying to format
the output. The middle, that's where the
abstract thinking is really happening.
Okay, theory time is over. That's all
super interesting. But how do we
actually do this? How do we turn this
idea of concept vectors into our own
little neuro stimulator for an AI? Let's
see what it looks like in practice.
You're going to love this. The actual
operation is shockingly simple. As the
model is thinking, you take its current
activation vector and you just add your
concept vector to it. That's it. The
coefficient here is just a number you
pick to control how strong the effect
is. Think of it like a volume dial for
the concept of say the Eiffel Tower. And
putting this into practice with a
library like hugging face transformers
is literally just a few steps. You load
your model, you load your vector, and
then you create this little function
called a hook. You attach that hook to a
specific layer like layer 15. And its
only job is to add your steering vector
to the activations every single time
they pass through. Then you just
generate text like normal. So let's see
what this actually looks like. We ask
the base llama model for business ideas
and we get normal stuff. E-commerce
services. Now we apply our Eiffel Tower
vector, but with a pretty low
coefficient, just a 4.0. And look, the
model's ideas subtly shift to food and
bakeries. It's not screaming Paris, but
you can feel that the perspective has
been nudged. And now the punch line. We
crank that dial way up to a coefficient
of 8.0. And we ask, "Who are you again?"
And there it is. The model completely
takes on the new persona. What's so
amazing is that the original response
started with I'm a large language model,
but the steered one starts with I'm a
large metal structure. The steering
literally changed the model's mind mid
thought. Okay, this is just incredible,
right? But it leads to a huge question.
Where in the world do you get these
magical steering vectors from? I mean,
how did anyone find the exact direction
for Eiffel Tower inside the model's
hidden space? Well, there are basically
two main ways to do it. The first is
called contrastive activation. That's
where you'd show the model a bunch of
text that has the concept you want and a
bunch of text that doesn't. You then
find the difference in the model's
internal activations and boom, that's
your vector. The other newer method uses
something called sparse autoenccoders.
You can think of this as a special tool
that can sift through all the models
messy internal thoughts and
automatically pull out a whole library
of individual concepts. And here's the
best part. You don't have to do all this
heavy lifting yourself. There are tools
like Neuronipedia where you can
literally just browse through concepts
that people have already discovered and
visualized. And over on the hugging face
hub, the community is sharing these
pre-trained autoenccoders and steering
vectors that you can just download and
start playing with right away. So, it's
pretty clear this is a seriously
powerful technique. But like any tool,
it's not magic. It's really important to
understand what steering is great at and
what it well it can't do. On the plus
side, the benefits are huge. You don't
need to do any expensive fine-tuning. It
works instantly. You can dial the
intensity up or down with just one
number and the effect holds up really
well. But then you have the cons.
Finding that perfect coefficient can be
tricky. If you turn it up too high, the
model can just start spouting nonsense.
And here's the big one. Steering cannot
teach the model new information. If the
model has never learned about a concept,
you can't just invent a vector for it.
You can only amplify what's already in
there. And all of this leaves us with a
really big kind of mindbending question
for the future. We're moving past just
talking to AIS with prompts. We are now
building the tools to perform a kind of
micro surgery on their internal ideas.
As this tech gets better and better,
what is that going to mean for how we
control and align the truly powerful AI
systems that are coming?