Personalized Daily Arxiv Papers 05/10/2024

Total relevant papers: 1

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

DOLOMITES: Domain-Specific Long-Form Methodical Tasks Authors: Chaitanya Malaviya, Priyanka Agrawal, Kuzman Ganchev, Pranesh Srinivasan, Fantine Huot, Jonathan Berant, Mark Yatskar, Dipanjan Das, Mirella Lapata, Chris Alberti

0. DOLOMITES: Domain-Specific Long-Form Methodical Tasks

ArXiv ID: 2405.05938 Authors: Chaitanya Malaviya, Priyanka Agrawal, Kuzman Ganchev, Pranesh Srinivasan, Fantine Huot, Jonathan Berant, Mark Yatskar, Dipanjan Das, Mirella Lapata, Chris Alberti

Abstract: arXiv:2405.05938v1 Announce Type: new Abstract: Experts in various fields routinely perform methodical writing tasks to plan, organize, and report their work. From a clinician writing a differential diagnosis for a patient, to a teacher writing a lesson plan for students, these tasks are pervasive, requiring to methodically generate structured long-form output for a given input. We develop a typology of methodical tasks structured in the form of a task objective, procedure, input, and output, and introduce DoLoMiTes, a novel benchmark with specifications for 519 such tasks elicited from hundreds of experts from across 25 fields. Our benchmark further contains specific instantiations of methodical tasks with concrete input and output examples (1,857 in total) which we obtain by collecting expert revisions of up to 10 model-generated examples of each task. We use these examples to evaluate contemporary language models highlighting that automating methodical tasks is a challenging long-form generation problem, as it requires performing complex inferences, while drawing upon the given context as well as domain knowledge.

Comment: This paper does not closely match any of the specified criteria. It introduces a benchmark for methodical tasks but does not focus on RLHF, test set contamination, diffusion language models, new evaluation paradigms, real-world usage and safety properties, or scaling laws in neural networks. Relevance: 3 Novelty: 3

Paper selection prompt

New methodological improvements to RLHF or instruction-following which are specific fine-tuning steps that are taken to make language models better at following user instructions across a range of tasks.
- Relevant: papers that discuss specific methods like RLHF, or instruction-tuning datasets, improving these methods, or analyzing them. Usually these papers will explicitly mention RLHF, instruction-following or instruction-tuning.
- Not relevant: papers about adaptation to some task. Simply following instructions or inputs are not sufficient.
Shows new powerful test set contamination or membership inference methods for language models. Test set contamination is the phenomenon where a language model observes a benchmark dataset during pretraining.
- Relevant: test statistics that can detect contamination of benchmarks in language models. statistics that can provide guarantees are more interesting. membership inference methods that are general enough to apply to language models are also relevant.
- Not relevant: any papers that do not consider language models, or that do not consider test set contamination.
Shows a significant advance in the performance of diffusion language models.
- Relevant: papers that study language models that are also diffusion models. Continuous diffusions are even more relevant, while discrete diffusions are less so.
- Not relevant: papers about image diffusions like DALL-E or Stable Diffusion, or papers that do not explicitly mention language models or applications to text.
Describes new paradigms to evaluating open-ended text generation. Evaluating the outputs of language models is hard, especially in open-ended settings like for chatbots.
- Relevant: papers that fundamentally rethink language model evaluation -- especially by accounting for subjectivity or using adversaries.
- Not relevant: specific evaluations for specific tasks, identifying new properties or flaws of language models, or simply collecting new data.
Conducts surveys or provides data into real-world usage and safety properties of language models.
- Relevant: papers that create new datasets or surveys on real-world usage of language models.
- Not relevant: papers that apply language models to new real-world tasks.
Studies 'scaling laws' in the context of neural networks. Scaling laws refer to the very clear power-law relationship between the size or computational power used to train a model and the performance of that model.
- Relevant: theoretical or conceptual explanation behind scaling laws for language models.
- Not relevant: papers that have experiments at different model scales (but do not explicitly fit a scaling law) or papers that mention scaling laws, but the scaling laws are not the central subject of the paper

In suggesting papers to your friend, remember that he enjoys papers on statistical machine learning, and generative modeling in natural language processing. Your friend also likes learning about surprising empirical results in language models, as well as clever statistical tricks. He does not want to read papers that are about primarily applications of methods to specific domains.