Building and Testing Better Sleeper Agents (SPAR 2024)
I’ve completed a research project with a small group, facilitated by SPAR. Our primary goal was to investigate the relationship between the complexity of a backdoor and its persistence to safety training. Notably, I designed and trained a family of arithmetic-based backdoors in llama3 (with QLoRA finetuning).
Our code is on GitHub and a writeup of our findings is here.
The shortened summary is that I didn’t find evidence that persistence changes between simpler vs. more complex versions of my arithmetic backdoor. I theorize that, for my backdoor (which uses CoT), the method by which we scale complexity doesn’t qualitatively change the complexity of the model. Hence, the effect of safety training on those backdoors does not vary.