Conversational AIThis researchbenchmarksGnosari Dataset v1.1

Conversational Extraction: Which LLMs actually capture structured data from real dialogue? Leaderboard

Name: Conversational Extraction: Which LLMs actually capture structured data from real dialogue? — Leaderboard Dataset
Creator: Neomanex
Published: 2026-04-19
License: https://opensource.org/licenses/MIT

11 LLMs from 3 providers on 44 multi-turn dialogues. gemini-2.5-pro enters at #3 (97.9%), tied with gpt-4o. Long-context stays the biggest discriminator (16.7 pp gap, n=6).

Rank	Model
#1	gpt-5.4(OpenAI)	98.9	100	100	97.7	97.7	5.3
#2	gpt-4o(OpenAI)	98.3	100	100	100	93.2	6.4
#3	gemini-2.5-pro(Google)	97.9	100	99.2	98.5	93.9	6.3
#4	gpt-5-reasoning(OpenAI)	97.2	100	100	99.2	89.4	7.2
#5	o4-mini(OpenAI)	97	100	97.7	99.2	90.9	8.1
#6	gpt-5.4-mini(OpenAI)	96.8	100	97	100	90.2	8.1
#7	gpt-4o-mini(OpenAI)	95.1	100	97.7	96.2	86.4	9.9
#8	llama-3.1-8b(Meta (OSS))	94.3	100	97.7	97.7	81.8	10.6
#9	llama-3.2-3b(Meta (OSS))	93.2	97.7	97.7	95.5	81.8	11.3
#10	gpt-5.4-nano(OpenAI)	93	100	95.5	97.7	78.8	12.3
#11	mistral-7b(Mistral (OSS))	92.6	100	95.5	97.7	77.3	12.7