Reproducible benchmarks, methodology, and leaderboards for conversational AI and AI operations.
1 publication
We tested 10 LLMs on 30 multi-turn service-inquiry dialogues. gpt-5.4 was the only model to hit 100% on implicit inference — a 25-point gap over an 8-way tie.