Populære emner
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
My group & collaborators have developed many popular benchmarks over the years, e.g., MMLU, MATH, APPS---really excited about our latest benchmark OMEGA Ω:
🔍Can LLMs really think outside the box in math?
a new benchmark probing 3 axes of generalization:
1️⃣ Exploratory
2️⃣ Compositional
3️⃣ Transformative
showing limitations of today's frontier AI and RL-training in these dimensions of generalization.
Inspired by Boden’s typology of creativity, OMEGA advances beyond prior benchmarks with a programmatically generated dataset that combines precise control with rich diversity. Spanning a wide range of mathematical domains, it is explicitly designed to evaluate distinct axes of generalization and creative reasoning.
By isolating and quantifying fine-grained failure modes, OMEGA provides a foundation for advancing LLMs toward genuine mathematical creativity—beyond mechanical proficiency.
Huge thanks to my postdoc @YiyouSun @UCBerkeley leading the project, and amazing collaborators @nouhadziri @HannaHajishirzi @allen_ai and other co-authors!

25. juni 2025
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies?
Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬
We built a benchmark to find out → OMEGA Ω 📐
💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇
work w. @UCBerkeley @allen_ai
A thread on what we learned 🧵

20,7K
Topp
Rangering
Favoritter