← How we tested
Conversational AIThis researchbenchmarksGnosari Dataset v1.1
Conversational Extraction: Which LLMs actually capture structured data from real dialogue? Leaderboard
11 LLMs from 3 providers on 44 multi-turn dialogues. gemini-2.5-pro enters at #3 (97.9%), tied with gpt-4o. Long-context stays the biggest discriminator (16.7 pp gap, n=6).
11 rows visible.
Sorted by Overall, descending. 11 rows.
| Rank | Model | |||||||
|---|---|---|---|---|---|---|---|---|
| #1 | gpt-5.4(OpenAI) | 98.9 | 100 | 100 | 97.7 | 97.7 | 5.3 | |
| #2 | gpt-4o(OpenAI) | 98.3 | 100 | 100 | 100 | 93.2 | 6.4 | |
| #3 | gemini-2.5-pro(Google) | 97.9 | 100 | 99.2 | 98.5 | 93.9 | 6.3 | |
| #4 | gpt-5-reasoning(OpenAI) | 97.2 | 100 | 100 | 99.2 | 89.4 | 7.2 | |
| #5 | o4-mini(OpenAI) | 97 | 100 | 97.7 | 99.2 | 90.9 | 8.1 | |
| #6 | gpt-5.4-mini(OpenAI) | 96.8 | 100 | 97 | 100 | 90.2 | 8.1 | |
| #7 | gpt-4o-mini(OpenAI) | 95.1 | 100 | 97.7 | 96.2 | 86.4 | 9.9 | |
| #8 | llama-3.1-8b(Meta (OSS)) | 94.3 | 100 | 97.7 | 97.7 | 81.8 | 10.6 | |
| #9 | llama-3.2-3b(Meta (OSS)) | 93.2 | 97.7 | 97.7 | 95.5 | 81.8 | 11.3 | |
| #10 | gpt-5.4-nano(OpenAI) | 93 | 100 | 95.5 | 97.7 | 78.8 | 12.3 | |
| #11 | mistral-7b(Mistral (OSS)) | 92.6 | 100 | 95.5 | 97.7 | 77.3 | 12.7 |

