Interpretability

1 article

Feature Recovery Feature-Learning Fourier-Features Grokking In-Context Learning Interpretability LLM Mechanistic-Interpretability Modular-Addition Sparse Autoencoders Transformers

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders