Representation Engineering Mistral-7B an Acid Trip

🔗 a linked post to vgel.me » — originally shared here on 2024-02-18

In October 2023, a group of authors from the Center for AI Safety, among others, published Representation Engineering: A Top-Down Approach to AI Transparency. That paper looks at a few methods of doing what they call "Representation Engineering": calculating a "control vector" that can be read from or added to model activations during inference to interpret or control the model's behavior, without prompt engineering or finetuning.

Being Responsible AI Safety and INterpretability researchers (RAISINs), they mostly focused on things like "reading off whether a model is power-seeking" and "adding a happiness vector can make the model act so giddy that it forgets pipe bombs are bad."

But there was a lot they didn't look into outside of the safety stuff. How do control vectors compare to plain old prompt engineering? What happens if you make a control vector for "high on acid"? Or "lazy" and "hardworking? Or "extremely self-aware"? And has the author of this blog post published a PyPI package so you can very easily make your own control vectors in less than sixty seconds? (Yes, I did!)

It’s been a few posts since I got nerdy, but this was a fascinating read and I couldn’t help but share it here (hat tip to the excellent Simon Willison for the initial share!)

The article explores how to improve the way we format data before it gets fed into a model, which then leads to better performance of the models.

You can use this technique to build a more resiliant model that is less prone to jailbreaking and produces more reliable output from a prompt.

Seems like something I should play with myself!

Continue to the full article →

Large Language Models