just read the AdvPrompter paper. just to confirm a couple things:
- AdvPrompterOpt describes a method of decoding from a probability distribution (e.g. an LM) while optimising for some external metric
- We're going to use this to generate highly-activating examples for specific SAE features
- You suggested using features that activate on "star wars" related concepts
8:48
Daniel Tan
preliminary thoughts:
- The underlying method seems remarkably simple! It samples likely tokens from the probability distribution, then decodes (greedily or via beam search) to find adversarial examples. The simplicity is really good for us, I'm just surprised I haven't heard of it before. Seems related to importance-weighted sampling
- I think we could actually get a really nice gradio or streamlit demo out of this if it works well!
8:49
Daniel Tan
I'm really curious to see what the generated examples look like for SAE features
Implementation
We encountered a fundamental difference between our setting and the AdvPrompter setting po
- In AdvPrompter, they are decoding from a model trained to predict sequences, so the next-token probabilities for target sequences will be high even at intermediate token positions
- In our case, it is not a-priori true that the model assigns high probability to intermediate tokens
Using EPO