Personalized Daily Arxiv Papers 05/16/2024

Total relevant papers: 2

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

A safety realignment framework via subspace-oriented model fusion for large language models Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models Authors: Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap

0. A safety realignment framework via subspace-oriented model fusion for large language models

ArXiv ID: 2405.09055 Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He

Abstract: arXiv:2405.09055v1 Announce Type: new Abstract: The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.

Comment: This paper is relevant to criterion 1 as it discusses a safety realignment framework for LLMs that includes instruction following, which is a specific fine-tuning step to make language models better at following instructions. It also touches on the safety aspects of language models, which is related to criterion 5, but not a direct match. Relevance: 5 Novelty: 6

1. PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models

ArXiv ID: 2405.09373 Authors: Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap

Abstract: arXiv:2405.09373v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.

Comment: This paper does not match any of the specified criteria closely. It introduces a multilingual toxicity evaluation benchmark, which is related to safety properties of language models but does not fit the criteria for methodological improvements to RLHF or instruction-following, test set contamination, diffusion language models, evaluation paradigms, or scaling laws. Relevance: 3 Novelty: 4

Paper selection prompt

New methodological improvements to RLHF or instruction-following which are specific fine-tuning steps that are taken to make language models better at following user instructions across a range of tasks.
- Relevant: papers that discuss specific methods like RLHF, or instruction-tuning datasets, improving these methods, or analyzing them. Usually these papers will explicitly mention RLHF, instruction-following or instruction-tuning.
- Not relevant: papers about adaptation to some task. Simply following instructions or inputs are not sufficient.
Shows new powerful test set contamination or membership inference methods for language models. Test set contamination is the phenomenon where a language model observes a benchmark dataset during pretraining.
- Relevant: test statistics that can detect contamination of benchmarks in language models. statistics that can provide guarantees are more interesting. membership inference methods that are general enough to apply to language models are also relevant.
- Not relevant: any papers that do not consider language models, or that do not consider test set contamination.
Shows a significant advance in the performance of diffusion language models.
- Relevant: papers that study language models that are also diffusion models. Continuous diffusions are even more relevant, while discrete diffusions are less so.
- Not relevant: papers about image diffusions like DALL-E or Stable Diffusion, or papers that do not explicitly mention language models or applications to text.
Describes new paradigms to evaluating open-ended text generation. Evaluating the outputs of language models is hard, especially in open-ended settings like for chatbots.
- Relevant: papers that fundamentally rethink language model evaluation -- especially by accounting for subjectivity or using adversaries.
- Not relevant: specific evaluations for specific tasks, identifying new properties or flaws of language models, or simply collecting new data.
Conducts surveys or provides data into real-world usage and safety properties of language models.
- Relevant: papers that create new datasets or surveys on real-world usage of language models.
- Not relevant: papers that apply language models to new real-world tasks.
Studies 'scaling laws' in the context of neural networks. Scaling laws refer to the very clear power-law relationship between the size or computational power used to train a model and the performance of that model.
- Relevant: theoretical or conceptual explanation behind scaling laws for language models.
- Not relevant: papers that have experiments at different model scales (but do not explicitly fit a scaling law) or papers that mention scaling laws, but the scaling laws are not the central subject of the paper

In suggesting papers to your friend, remember that he enjoys papers on statistical machine learning, and generative modeling in natural language processing. Your friend also likes learning about surprising empirical results in language models, as well as clever statistical tricks. He does not want to read papers that are about primarily applications of methods to specific domains.