Jurgens: Meta’s new AI model, Llama4, still fails real world tests

Quoted by The Hill. Associate professor David Jurgens. AI has evolved. It's time for better evaluations and report cards.

Wednesday, 05/21/2025

By Noor Hindi

Meta’s recently launched Llama4 model has been praised for its “unrivaled speed and efficiency,” but experts caution that those metrics don’t tell the whole story.

In an op-ed for The Hill, University of Michigan School of Information associate professor David Jurgens, an expert on artificial intelligence and computational social science, says across the AI industry, models routinely pass standard benchmarks and rank highly on leaderboards yet falter when deployed in real-world situations.

“Other leading AI models have lied about real people, advised businesses to break the law and excluded certain groups of people from getting jobs,” Jurgens says. “It’s a sign that our methods for evaluating the effectiveness of AI don’t translate to real-world applications and outcomes.”

Instead, Jurgens argues for “report cards that evaluate AI more holistically.”

“AI developers need to learn what makes their products effective from the perspective of their stakeholders and customers, Jurgens says. “With that knowledge, companies and benchmark developers can ensure their data can adequately evaluate AI.”

Learn more about UMSI associate professor David Jurgens by visiting his UMSI faculty profile.