Skip to content
Neomanex
← How we tested
Conversational AIThis researchbenchmarksGnosari Dataset v1.1

Conversational Extraction: Which LLMs actually capture structured data from real dialogue? Leaderboard

11 LLMs from 3 providers on 44 multi-turn dialogues. gemini-2.5-pro enters at #3 (97.9%), tied with gpt-4o. Long-context stays the biggest discriminator (16.7 pp gap, n=6).

11 rows visible.

Sorted by Overall, descending. 11 rows.

RankModel
#1gpt-5.4(OpenAI)98.910010097.797.75.3
#2gpt-4o(OpenAI)98.310010010093.26.4
#3gemini-2.5-pro(Google)97.910099.298.593.96.3
#4gpt-5-reasoning(OpenAI)97.210010099.289.47.2
#5o4-mini(OpenAI)9710097.799.290.98.1
#6gpt-5.4-mini(OpenAI)96.81009710090.28.1
#7gpt-4o-mini(OpenAI)95.110097.796.286.49.9
#8llama-3.1-8b(Meta (OSS))94.310097.797.781.810.6
#9llama-3.2-3b(Meta (OSS))93.297.797.795.581.811.3
#10gpt-5.4-nano(OpenAI)9310095.597.778.812.3
#11mistral-7b(Mistral (OSS))92.610095.597.777.312.7