just read the AdvPrompter paper. just to confirm a couple things:

AdvPrompterOpt describes a method of decoding from a probability distribution (e.g. an LM) while optimising for some external metric
We're going to use this to generate highly-activating examples for specific SAE features
You suggested using features that activate on "star wars" related concepts

Daniel Tan

preliminary thoughts:

The underlying method seems remarkably simple! It samples likely tokens from the probability distribution, then decodes (greedily or via beam search) to find adversarial examples. The simplicity is really good for us, I'm just surprised I haven't heard of it before. Seems related to importance-weighted sampling
I think we could actually get a really nice gradio or streamlit demo out of this if it works well!

Daniel Tan

I'm really curious to see what the generated examples look like for SAE features

Implementation

We encountered a fundamental difference between our setting and the AdvPrompter setting po

In AdvPrompter, they are decoding from a model trained to predict sequences, so the next-token probabilities for target sequences will be high even at intermediate token positions
In our case, it is not a-priori true that the model assigns high probability to intermediate tokens

Using EPO