This study examines the impact of fine-tuning on language model interpretability using sparse autoencoders (SAEs) and causal interventions. We hypothesized that fine-tuned models would manifest more relevant and sharper features than baseline models. However, our results revealed that the baseline GPT-2 model outperformed the fine-tuned version in both feature interpretability and responsiveness to causal interventions. The baseline model exhibited higher average coherence scores (0.5438 vs. 0.3511) and greater vector steering receptiveness (99% vs. 83.7%), suggesting that fine-tuning may lead to more specialized but less coherent features.
@article{gupta2024unveiling,title={Unveiling the Black Box: Causal Inference and Feature Analysis in Fine-Tuned Language Models Using Sparse Autoencoders},author={Gupta, Rini and Sica, Sean},year={2024},month=aug,}