Close

Presentation

Mind Your Manners: Detoxifying Language Models via Attention Head Intervention
DescriptionTransformer-based Large Language Models have advanced natural language processing with their ability to generate fluent text. However, these models exhibit and amplify toxicity and bias learned from training data, posing new ethical challenges. We build upon the \attnlens{} framework to allow for scalable decoding of attention mechanism information. We then use this decoded information to implement a pipeline to generate and remove toxic memories from pre-trained language models in a way that is both human interpretable and effective while retaining model performance.