Reproducible benchmarks, methodology, and leaderboards for conversational AI and AI operations.
1 publication
11 LLMs from 3 providers on 44 multi-turn dialogues. gemini-2.5-pro enters at #3 (97.9%), tied with gpt-4o. Long-context stays the biggest discriminator (16.7 pp gap, n=6).