research

Selected publications and ongoing research work.

I focus on mechanistic interpretability and practical AI safety research.

2024

Unveiling the Black Box: Causal Inference and Feature Analysis in Fine-Tuned Language Models Using Sparse Autoencoders

Rini Gupta and Sean Sica

Aug 2024

Abs Bib Code

This study examines the impact of fine-tuning on language model interpretability using sparse autoencoders (SAEs) and causal interventions. We hypothesized that fine-tuned models would manifest more relevant and sharper features than baseline models. However, our results revealed that the baseline GPT-2 model outperformed the fine-tuned version in both feature interpretability and responsiveness to causal interventions. The baseline model exhibited higher average coherence scores (0.5438 vs. 0.3511) and greater vector steering receptiveness (99% vs. 83.7%), suggesting that fine-tuning may lead to more specialized but less coherent features.
@article{gupta2024unveiling, title = {Unveiling the Black Box: Causal Inference and Feature Analysis in Fine-Tuned Language Models Using Sparse Autoencoders}, author = {Gupta, Rini and Sica, Sean}, year = {2024}, month = aug, }