
FML-Bench: Evaluating AI Agents on Real-World Machine Learning Research Codebases Beyond Kaggle
Why FML-Bench Matters When we first meet FML-bench, it looks like another benchmark, but it is trying to answer a harder question: can an AI agent work inside a real machine learning research codebase and make a meaningful scientific improvement? Why does FML-bench matter if we already have plenty of







