My group & collaborators have developed many popular benchmarks over the years, e.g., MMLU, MATH, APPS---really excited about our latest benchmark OMEGA Ω: 🔍Can LLMs really think outside the box in math? a new benchmark probing 3 axes of generalization: 1️⃣ Exploratory 2️⃣ Compositional 3️⃣ Transformative showing limitations of today's frontier AI and RL-training in these dimensions of generalization. Inspired by Boden’s typology of creativity, OMEGA advances beyond prior benchmarks with a programmatically generated dataset that combines precise control with rich diversity. Spanning a wide range of mathematical domains, it is explicitly designed to evaluate distinct axes of generalization and creative reasoning. By isolating and quantifying fine-grained failure modes, OMEGA provides a foundation for advancing LLMs toward genuine mathematical creativity—beyond mechanical proficiency. Huge thanks to my postdoc @YiyouSun @UCBerkeley leading the project, and amazing collaborators @nouhadziri @HannaHajishirzi @allen_ai and other co-authors!
Nouha Dziri
Nouha Dziri25. juni 2025
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇 work w. @UCBerkeley @allen_ai A thread on what we learned 🧵
20,7K