Total relevant papers: 2
Paper selection prompt and criteria at the bottom
Table of contents with paper titles:
A safety realignment framework via subspace-oriented model fusion for large language models Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models Authors: Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap
ArXiv ID: 2405.09055 Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He
Abstract: arXiv:2405.09055v1 Announce Type: new Abstract: The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
Comment: This paper is relevant to criterion 1 as it discusses a safety realignment framework for LLMs that includes instruction following, which is a specific fine-tuning step to make language models better at following instructions. It also touches on the safety aspects of language models, which is related to criterion 5, but not a direct match. Relevance: 5 Novelty: 6
ArXiv ID: 2405.09373 Authors: Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap
Abstract: arXiv:2405.09373v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
Comment: This paper does not match any of the specified criteria closely. It introduces a multilingual toxicity evaluation benchmark, which is related to safety properties of language models but does not fit the criteria for methodological improvements to RLHF or instruction-following, test set contamination, diffusion language models, evaluation paradigms, or scaling laws. Relevance: 3 Novelty: 4
In suggesting papers to your friend, remember that he enjoys papers on statistical machine learning, and generative modeling in natural language processing. Your friend also likes learning about surprising empirical results in language models, as well as clever statistical tricks. He does not want to read papers that are about primarily applications of methods to specific domains.