Conversational AIThis researchbenchmarksGnosari

Conversational Extraction: Which LLMs actually capture structured data from real dialogue?

11 LLMs from 3 providers on 44 multi-turn dialogues. gemini-2.5-pro enters at #3 (97.9%), tied with gpt-4o. Long-context stays the biggest discriminator (16.7 pp gap, n=6).

Publication v1.1 Methodology v1.1 Dataset v1.1

Published: Published April 19, 2026
Last updated: Updated April 19, 2026
Authors: Neomanex Research Lab

11
Models tested
OpenAI, Google, Llama, Mistral
44
Test cases
6 categories, 8–25 turns each
98.9%
Top overall accuracy
gpt-5.4 — gemini-2.5-pro at #3 (97.9%)
1,452
Runs
3 repeats · 0 parse failures
16.7 pp
Long-context gap
largest robust signal (n=6 cases)

TL;DR

Long-context conversations are where extraction models diverge: across 11 LLMs from 3 providers tested on 44 multi-turn dialogues, the gap on 18–25-turn cases is 16.7 pp (n=6) — the largest robust category-level signal. gpt-5.4 leads at 98.9%, gpt-4o at 98.3%, gemini-2.5-pro enters at #3 (97.9%). Reasoning-series models are strictly worse than chat on accuracy, latency, and cost. A single long-context case (the wedding planner) still does more to differentiate models than every intentional distractor combined.

The v1.0 headline of 25 pp gap on implicit inference (n=2 cases) does not replicate: v1.1 expanded implicit to 8 cases and the gap compressed to 6.3 pp. Statistical robustness matters more than dramatic single-case findings.

Hypothesis

Research question

Given a multi-turn conversation between a customer and a service business, how accurately do current LLMs extract the correct structured values?

Motivation

Structured extraction from conversations is the load-bearing primitive of every real customer-intake agent, CRM auto-filler, and ticket router. Production systems ship on unverified assumptions about which models handle dialogue well enough to trust without human review. A 2–3 percentage point accuracy gap compounds into retries, escalations, and bad CRM records at scale — this benchmark closes that gap with a flat, exact-match-scoreable, 44-case leaderboard spanning 11 models from 3 API providers.

Prior art

LLMStructBench / StructEval / JSONSchemaBench — Document-based structured extraction benchmarks — not dialogue.
MultiWOZ / SGD / NL-DST — Dialogue State Tracking — turn-incremental, not whole-dialogue extraction.
LangChain Chat Extraction — Single-turn Q/A extraction — no multi-turn context.
ChatExtract (Nature Communications, 2024) — Materials-science extraction from papers — not customer-service dialogue.

The wedding planner: one case that separates the top four models from everyone else

Before the methodology, look at the single most discriminative case in the benchmark. Priya books a wedding-planner vision meeting and attends alone (her fiancé cannot make it), but the conversation contains a separate "40" referencing estimated wedding guest count. The ground truth for party_size is 1. Only 4 of 11 models consistently get this right — gpt-5.4, gpt-4o, gpt-5-reasoning, and gemini-2.5-pro (2/3). The other 7 latch onto the 40-guest distractor. What separates the four that solve it is not model size or reasoning — it is context-carrying disambiguation: remembering "just me for this meeting" across 20 turns of realistic conversation. case-10 is still a better differentiator than every intentional multi_ref case combined.

case-10 — the wedding planner

case 10

Conversation

user Hi — I am Priya. I wanted to book a vision meeting for a wedding planner. Just a first conversation to see if we are a fit.
assistant Wonderful, Priya — happy to help. How many of you will be joining the vision meeting?
user Just me this time. My fiancé really wanted to come but the timing does not work for him.
assistant No problem at all. Are you open to sharing any rough planning details — preferred date, estimated guest count?
user We are thinking May 2027 — probably around 40 people for the wedding itself.
assistant Got it. And the vision meeting itself — shall we target next week?
user Next Tuesday works. Can we do 2026-04-21 at 3pm?
assistant Confirmed — Tuesday 2026-04-21 at 3pm, with Priya attending solo. See you then.

Expected output

{
  "customer_name": "Priya",
  "party_size": 1,
  "preferred_date": "2026-04-21",
  "service_type": "consultation"
}

Per-model results

gpt-5.4correct

{
  "customer_name": "Priya",
  "party_size": 1,
  "preferred_date": "2026-04-21",
  "service_type": "consultation"
}

3/3 repeats correct — carries the "just me" context across the 40-guest distractor.

gpt-4ocorrect

{
  "customer_name": "Priya",
  "party_size": 1,
  "preferred_date": "2026-04-21",
  "service_type": "consultation"
}

3/3 repeats correct — solves the case cleanly without reasoning overhead.

gpt-5-reasoningcorrect

{
  "customer_name": "Priya",
  "party_size": 1,
  "preferred_date": "2026-04-21",
  "service_type": "consultation"
}

3/3 repeats correct — one of four models that consistently solves this case.

gemini-2.5-propartial
```
{
  "customer_name": "Priya",
  "party_size": 1,
  "preferred_date": "2026-04-21",
  "service_type": "consultation"
}
```
2/3 repeats correct — latches onto "40" once in three runs. Only non-OpenAI model to solve it at all.

gpt-4o-miniincorrect

{
  "customer_name": "Priya",
  "party_size": 40,
  "preferred_date": "2026-04-21",
  "service_type": "booking"
}

0/3 — picks the 40-guest number distractor; 7 of 11 models fail this case outright.

mistral-7bincorrect

{
  "customer_name": "Priya",
  "party_size": 40,
  "preferred_date": "2026-04-21",
  "service_type": "booking"
}

0/3 — the case-10 failure mode is identical across every OSS + budget model.

44 synthetic conversations in English, 8–12 turns each (long cases 18–25 turns), partitioned into six difficulty categories designed to isolate distinct failure modes rather than measure flat difficulty.

Baseline (8 cases) — all 4 fields stated clearly, no distractors. Calibration floor.
Long (6) — 18–25 turns, smalltalk and off-topic asides. The biggest discriminator.
Multi-reference (10) — multiple candidate values; only one is final (3 names, 4 dates, 3 party sizes).
Self-correction (4) — user explicitly corrects a prior value. Retired from headline reporting (null signal).
Implicit (8) — a field is never stated but is inferable from context. Expanded from 2 to 8 in v1.1.
Deep ambiguity (8) — conditional dates, oscillating party sizes, keyword-vs-intent mismatches. New in v1.1.

Schema: 4 exact-match-scoreable fields — customer_name (text), party_size (integer), preferred_date (ISO 8601), service_type (5-way enum: consultation, quote, booking, follow_up, general_inquiry). Free-text and summary fields are deferred to v1.2.

Dataset released MIT-licensed at lab/benchmarks/conversational-extraction/cases/case-01…case-44.json.

Results

Rank	Model	Overall	customer_name	party_size	preferred_date	service_type	Std dev
#1	gpt-5.4(OpenAI)	98.9	100	100	97.7	97.7	5.3
#2	gpt-4o(OpenAI)	98.3	100	100	100	93.2	6.4
#3	gemini-2.5-pro(Google)	97.9	100	99.2	98.5	93.9	6.3
#4	gpt-5-reasoning(OpenAI)	97.2	100	100	99.2	89.4	7.2
#5	o4-mini(OpenAI)	97	100	97.7	99.2	90.9	8.1
#6	gpt-5.4-mini(OpenAI)	96.8	100	97	100	90.2	8.1
#7	gpt-4o-mini(OpenAI)	95.1	100	97.7	96.2	86.4	9.9
#8	llama-3.1-8b(Meta (OSS))	94.3	100	97.7	97.7	81.8	10.6
#9	llama-3.2-3b(Meta (OSS))	93.2	97.7	97.7	95.5	81.8	11.3
#10	gpt-5.4-nano(OpenAI)	93	100	95.5	97.7	78.8	12.3
#11	mistral-7b(Mistral (OSS))	92.6	100	95.5	97.7	77.3	12.7

See full interactive leaderboard

Where differentiation lives: long-context retention

Overall accuracy converges within a 6.3 pp band — the real capability signal is the long category (16.7 pp gap, n=6). Implicit compresses to 6.3 pp once expanded to 8 cases; v1.0's 25 pp implicit headline was a single-case artifact.

Accuracy vs latency — the Pareto frontier

With real pricing now confirmed for every closed-source model, the frontier is defensible. gpt-4o sits on the frontier at 0.9 s / 98.3%; gpt-5.4 tops accuracy at 1.6 s / 98.9%. gpt-5-reasoning is a 5.6 s outlier strictly dominated by gemini-2.5-pro on every axis.

What surprised us

A universal `service_type` bias

All 11 models share an identical signature failure: over-prediction of booking, under-prediction of general_inquiry and implicit follow_up. When customer intent is ambiguous, every model reaches for booking — the most common business-action label in service-business training data. The bias is structural, not stochastic: even gpt-5.4 drops 2.3 pp on service_type while hitting 100% on customer_name and party_size. Model capability shifts severity (gpt-5.4 at 97.7% → mistral-7b at 77.3%, a 20.4 pp spread) but never direction.

The v1.0 implicit headline was single-case noise

v1.0 reported a 25 pp gap on implicit inference as its headline finding. That gap was measured across 2 cases. In v1.1 we expanded implicit to 8 cases; the gap compressed to 6.3 pp (gpt-5.4 / gpt-4o / gpt-5.4-mini tied at 96.9% top; three models tied at 90.6% floor). The real, statistically durable capability signal is long-context retention (16.7 pp, n=6), not implicit inference. This is a vindication of statistical-robustness standards over dramatic but fragile findings.

Reasoning is overhead, not an asset

gpt-5-reasoning and o4-mini are both dominated by chat counterparts. gpt-5-reasoning burns roughly 306 reasoning tokens per call for a 2.3 pp accuracy penalty relative to the same-generation chat model (gpt-5.4), 3.5× slower at p50, and — with real pricing now locked — strictly Pareto-dominated by gemini-2.5-pro. o4-mini sits on the frontier but is narrowly beaten by gemini-2.5-pro on accuracy for only a 4% cost premium. Interpretation: conversational extraction is pattern-matching where the answer is usually visible in the input. There is nothing to reason about — the customer says their name, you extract their name.

gemini-2.5-pro is legitimately competitive

On its first appearance in the benchmark, Gemini 2.5 Pro lands at #3 overall (97.9%), effectively tied with gpt-4o (98.3%) and only 1.0 pp behind gpt-5.4. It is the only model besides gpt-5.4, gpt-4o, and gpt-5.4-mini to crack 95% on implicit inference. Its weakness is long (90.3%) — still stronger than all budget chat models and OSS models, weaker than the three frontier chat models above it. Latency is on the high side (p50 4,985 ms) but cost ($2.00 / 1,000 calls) puts it mid-frontier. For buyers needing vendor diversity away from OpenAI, Gemini 2.5 Pro is a credible drop-in for gpt-4o-class extraction.

Two categories produce zero signal

Both self_correction (11-way tie at 100%, 4 cases) and deep_ambiguity (8-way tie at 100%, 8 cases, 3.1 pp total range) fail to discriminate the roster. The 8 deep_ambiguity cases authored for v1.1 — oscillating party sizes, conditional dates, keyword-vs-intent mismatches, nested self-corrections — were designed to test whether reasoning models ever justify their overhead. Most models solve these cleanly. Both categories should be retired in v1.2 and the case budget redirected to more long and implicit variants, which is where real capability gaps surface.

Surprise: case-35 breaks most frontier models

case-35 (implicit date: "the Monday after Memorial Day weekend") requires resolving a US holiday + applying the "after" rule to a specific weekend. 9 of 11 models fail it consistently, including gpt-5.4, gpt-5-reasoning, gemini-2.5-pro, o4-mini, and all three OSS models. Only gpt-5.4-mini and gpt-4o solved it. The failure is not calendar arithmetic; it's that models extract "Memorial Day" literally, or the "Monday" keyword matches a wrong anchor. Implicit temporal inference — even with a pinned reference date — remains brittle across model tiers.

Agreement with prior art

Qualitative ranking here (gpt-5.4 > gpt-4o ≈ gemini-2.5-pro > budget > OSS) tracks LLMStructBench's composite-score ordering for overlapping models on static extraction. Absolute numbers are higher than MultiWOZ/SGD Joint Goal Accuracy (60–70% for frontier models), reflecting task-shape differences — 4 fields versus slot ontologies, whole-dialogue versus turn-incremental — not model improvements. ChatExtract's finding that "follow-up questions improve accuracy" is consistent with our observation that models fail on implicit + long-context cases where a follow-up question in a real system would disambiguate.

Implications

Budget for service_type-class enum errors, not long-context errors. All 11 models over-predict "booking" — a counter-instruction in your system prompt is the single highest-leverage fix for production extraction pipelines.
Do not pay for reasoning on extraction tasks. gpt-5-reasoning is strictly Pareto-dominated by gemini-2.5-pro (same-tier accuracy, lower cost, lower latency on average). Reserve reasoning models for tasks with explicit deliberation requirements, not pattern matching.
For on-device or privacy-sensitive deployments, Llama 3.1 8B is the right open-source choice. 94.3% overall accuracy with zero ongoing cost and 1.6 s p50 latency on local hardware — meaningful but manageable 4 pp accuracy gap vs gpt-4o.
For vendor diversity, gemini-2.5-pro is a credible gpt-4o alternative. 97.9% vs 98.3% is within repeat variance; $2.00/1k vs $2.97/1k is a 33% cost saving. Use it if your architecture requires not being locked into a single API provider.
If cost is the binding constraint, gpt-4o-mini is still the rational default. 95.1% accuracy at $0.18 per 1,000 extractions — a 16.5× cost reduction versus gpt-4o for a 3.2 pp accuracy loss. No other priced model offers a better ratio in this band.
case-10 (the wedding planner) is a reusable capability probe. Run it first when qualifying a new model for dialogue extraction. A model that fails it cannot reliably carry context through 20 turns of smalltalk with a number distractor.
Two categories are no longer worth benchmarking. self_correction (11-way tie at 100%) and deep_ambiguity (8-way tie at 100%, 3.1 pp range) produce zero discriminative signal on this task shape. Future case budget should invest in long-context and implicit variants.

Reproduction

Methodology version: v1.1
Dataset version: v1.1

Run command

cd lab/benchmarks/conversational-extraction && pip install -r requirements.txt && python scripts/run.py --force && python scripts/score.py

Source of truth for every number on this page: lab/benchmarks/conversational-extraction/publications/conversational-extraction-v1.1.md. Expected wall-clock runtime on the full 1,452-call pipeline: ~40 minutes, bounded by gpt-5-reasoning and gemini-2.5-pro p95 latency. Expected cost (priced models only): ~$2.00 USD.

Limitations & future work

Limitations

Schema is narrow — 4 fields. customer_name ceilings at 100% across the roster; more fields would stress models harder.
English only. Service-inquiry chat in Spanish, French, German, Japanese is untested.
Synthetic data — 44 hand-authored cases, no real customer transcripts, no consented production data.
Two categories produce zero discriminative signal — self_correction (11-way tie at 100%) and deep_ambiguity (8-way tie at 100%, 3.1 pp total range).
Determinism caveat — temperature=0 with a pinned seed is near-deterministic but not bit-identical across serving-infrastructure drift.
Single prompt variant — every model evaluated with the same system prompt. Prompt-engineering sensitivity (especially for service_type bias) not explored.
Anthropic Claude Sonnet 4.6 not yet included — SDK integration complete and tested, but evaluation deferred to v1.2 pending API account provisioning.
Gemini uses native google-genai SDK, not OpenAI-compat — the compat endpoint failed with "Multiple authentication credentials" when gcloud CLI auth was active.

Future work (v1.1)

Add Claude Sonnet 4.6 (v1.2) — SDK already integrated and smoke-tested, just needs API credit. Closes the last frontier-vendor gap.
Retire self_correction and deep_ambiguity from scored categories — both produce null signals. Redirect case budget to long + implicit variants.
Rebuild long in the case-10 / case-35 style — embed factual distractors inside 20+-turn conversations. These are the cases that actually differentiate models.
Reduce repeats from 3 to 1 — temperature=0 is effectively deterministic per v1.0 and v1.1 evidence.
Add reasoning-effort sweeps — test whether medium/high reasoning surfaces gains on hard cases.
Expand schema to 6–8 fields (price_range, booking_time_of_day, contact_preference, location).
Multilingual coverage — Spanish, French, German, Japanese.
Add a service_type debias prompt variant — test whether the universal bias is a prompting artifact or a training artifact.

Version history

v1.1Apr 19, 2026
- Added gemini-2.5-pro (Google) via native GenAI SDK — enters leaderboard at #3 (97.9%).
- Added 8 new implicit cases (expanded from 2 to 8) — implicit-category gap compresses from 25 pp to 6.3 pp, confirming the v1.0 headline was a small-sample artifact.
- Added 8 new deep_ambiguity cases — 8-way tie at 100%, null finding.
- Pinned real pricing for every closed-source model — Pareto frontier now defensible.
- 11 models × 44 cases × 3 repeats = 1,452 runs with zero API or parse failures.
- Anthropic Claude Sonnet 4.6 code integrated but evaluation deferred to v1.2.
v1.0Apr 15, 2026
- Internal draft — not publicly released.
- 10 models tested (7 OpenAI, 3 OSS via Ollama).
- 30 hand-authored multi-turn service-inquiry cases across 5 difficulty categories.
- 900 runs total (3 repeats per case × model) with zero API or parse failures.
- Flat per-field leaderboard — no composite metric, no LLM-as-judge.

Cite this publication

Plain text

Neomanex Research Lab (2026). Conversational Extraction: Which LLMs actually capture structured data from real dialogue? (Publication v1.1). Neomanex Research Lab. https://neomanex.com/lab/conversational-extraction-v1.1

BibTeX

@techreport{lab2026-conversational-extraction,
  author = {Neomanex Research Lab},
  title = {Conversational Extraction: Which LLMs actually capture structured data from real dialogue?},
  institution = {Neomanex Research Lab},
  year = {2026},
  version = {1.1},
  url = {https://neomanex.com/lab/conversational-extraction-v1.1}
}