M
MeshWorld.
AI Enterprise LLM AI Agents Evaluation 9 min read

How to Evaluate LLMs for Enterprise Use: Beyond Benchmarks

Maya
By Maya

The MMLU benchmark score tells you nothing useful about whether a model will perform on your support tickets, your legal documents, or your internal knowledge base. Enterprise LLM evaluation is a separate discipline from academic benchmarking — it starts with your data, your failure modes, and your business constraints. Here’s how to do it properly.

:::note[TL;DR]

  • Public benchmarks measure general capability; you need to measure performance on your specific task and data
  • Build an evaluation set from 100–200 real examples drawn from your actual workload
  • Measure: task accuracy, hallucination rate, latency at your p95 percentile, and cost per unit of work
  • RAG when your data changes frequently or is too large to fine-tune; fine-tuning when you need consistent format/style or domain vocabulary
  • Vendor lock-in and data residency are real constraints — map them before procurement, not after :::

Why do benchmarks mislead enterprise buyers?

MMLU, HellaSwag, HumanEval, and similar benchmarks were designed to measure broad general intelligence across many domains. They’re useful for researchers comparing models at the capability frontier. They’re not useful for answering “will this model correctly classify our customer complaints?” or “does it understand our internal terminology?”

The mismatch is fundamental. Benchmark tasks are designed to be unambiguous and generalizable. Enterprise tasks are domain-specific, often ambiguous, and measured against your organization’s definition of correct — which may differ from the benchmark’s. A model that scores 90% on MMLU and 65% on your evaluation set is not performing at 90% for your use case. The 65% is the number that matters.

The second issue: benchmark conditions don’t reflect your deployment conditions. Latency, cost per call, context window utilization with real documents, and failure behavior under unusual inputs aren’t in the benchmark.


How do you build an evaluation set?

An evaluation set is a collection of input → expected output pairs drawn from your real workload. It’s the foundation of everything else.

Size: 100–200 examples is enough to start. More is better, but a small high-quality set is worth more than a large set with ambiguous labels.

How to collect examples:

  • Pull 200 recent examples of the task you’re automating. For a support classification agent, that’s 200 recent support tickets with their correct category (labeled by your team). For a document summarization agent, that’s 200 documents with summaries your team would consider correct.
  • Include edge cases you’ve seen in production. If 10% of your tickets are in Spanish and the agent needs to handle them, make sure 10% of your eval set is Spanish tickets.
  • Include deliberate failure cases — inputs designed to make the model fail (unusual phrasing, missing information, adversarial content). You want to know failure modes before deployment.

Labeling: Human labels are ground truth. For tasks with subjective output (summaries, generated text), define a rubric with concrete criteria. “Good summary” is not a criterion. “Captures all action items mentioned in the document and is under 100 words” is a criterion.


What should you actually measure?

Task accuracy. For classification tasks: percentage of correct labels. For extraction tasks: percentage of required fields correctly extracted. For generation tasks: percentage passing your rubric.

A useful way to report it: per-category accuracy, not just overall. An agent that’s 95% accurate overall but 40% accurate on the “billing issue” category — your most critical ticket type — is not a success.

Hallucination rate. For any task where the model generates factual claims (summaries, answers from documents, data extraction), measure how often it generates information not present in the source. This requires human review of a sample. Even a 2% hallucination rate is significant if the hallucinated content is used in customer-facing output.

A simple measurement: randomly sample 50 generation outputs, manually check each for unsupported claims, report as percentage.

Latency. Measure p50 (median) and p95 (95th percentile). Your p95 is what users experience on slow days. Measure under realistic load conditions, not idle benchmarks. A model that returns in 800ms median but 6 seconds at p95 will cause UX problems.

Cost per unit. Tokens in + tokens out × cost per 1K tokens. Measure this on your actual eval set with real context lengths — not the model provider’s calculator example which assumes short inputs. Long system prompts, retrieval context, and conversation history add up faster than expected.


RAG vs fine-tuning vs prompt engineering — how do you choose?

These are the three main customization approaches, and the right choice depends on your situation.

Prompt engineering first. Always try this before anything else. A well-engineered prompt with clear instructions, examples, and constraints solves 60–70% of enterprise customization needs with zero additional infrastructure. If your task works with prompt engineering alone, stop there.

RAG (Retrieval-Augmented Generation) when:

  • Your knowledge base is large (thousands of documents) or changes frequently
  • Users need answers grounded in specific, up-to-date documents
  • You need citation or source attribution
  • You can’t or don’t want to fine-tune (cost, data security, expertise)

RAG connects the model to a vector database. At query time, relevant chunks are retrieved and included in the prompt context. The model answers based on the retrieved content rather than (only) its training data. RAG reduces hallucination on domain-specific topics because the correct information is in the context.

Fine-tuning when:

  • You need consistent output format or structure that’s hard to enforce with prompting
  • You have domain-specific vocabulary, abbreviations, or terminology the base model gets wrong
  • You have 1,000+ high-quality labeled examples and the budget for training runs
  • Latency is critical and a smaller fine-tuned model can match a larger base model’s quality on your task

Fine-tuning doesn’t add knowledge (use RAG for that) — it adjusts behavior, style, and format. A common mistake: fine-tuning to inject knowledge (e.g., product documentation), which works poorly and requires retraining every time the docs change. Use RAG for knowledge, fine-tuning for behavior.


What compliance requirements do you need to map?

Before selecting a vendor or model, get answers to these questions. Discovering a blocker during procurement is expensive; discovering it post-deployment is worse.

Data residency. Where does your data go when you call the API? Is it processed in the EU, US, or both? If you’re in healthcare, finance, or a regulated industry in the EU, GDPR requires knowing where personal data is processed. Some providers offer regional API endpoints (Azure OpenAI’s EU regions, for example) to address this.

Data retention. Does the provider store your prompts and completions? For how long? Is it used for training? Most enterprise plans allow opting out of training data retention, but you need to confirm this explicitly and get it in writing.

SOC 2 and ISO 27001. Are these certifications current? Request the audit reports, not just a badge on the website. Check the report date — a SOC 2 from 2023 doesn’t cover 2026 deployment.

HIPAA Business Associate Agreement (BAA). If you’re in healthcare, you need a BAA with your AI provider before processing any PHI. Not all providers offer this; confirm before selecting.

Audit logging. Can you export a log of all API calls including prompts and responses? Some industries require this for compliance review. Build this into your evaluation criteria.


How do you compare multiple models objectively?

Run all candidate models against the same evaluation set under the same conditions. Never compare models using different prompts — the prompt matters as much as the model for most tasks.

A comparison matrix:

ModelTask accuracyHallucination ratep95 latencyCost/1K tasks
Model A87%1.2%2.1s$3.20
Model B91%0.8%4.8s$12.40
Model C82%3.1%0.9s$0.80

The right choice depends on your priorities. If latency and cost matter more than marginal accuracy, Model C might be right despite lower accuracy. If hallucination is unacceptable (medical, legal, financial), Model B’s 0.8% rate is worth the cost premium. There is no universally correct answer — the matrix makes the tradeoffs explicit.


Summary

  • Build an evaluation set from 100–200 real examples before comparing any models
  • Measure task accuracy per category, hallucination rate (sampled), p95 latency, and cost per unit
  • Prompt engineering first; RAG for large/changing knowledge bases; fine-tuning for consistent format and domain vocabulary
  • Map data residency, retention, SOC 2, and BAA requirements before selecting a vendor
  • Compare models on the same evaluation set with the same prompts — different prompts make comparisons meaningless

FAQ

How often should we re-run evaluations after deployment?

Monthly at minimum; weekly for high-stakes tasks. Model providers update models without always announcing behavioral changes. Your data distribution also shifts over time — new product features, seasonal ticket patterns, updated policies. A monthly eval run against a stable test set catches regressions before users notice them.

What sample size do we need for statistical significance?

For binary outcomes (correct/incorrect), 100 examples gives you a margin of error of about ±10 percentage points at 95% confidence. 200 examples gives ±7 points. For reporting accuracy differences between models, you need at least 200 examples to reliably detect a 5-point accuracy difference. Below 50 examples, results are more noise than signal.

We have limited labeled data. Can we still build an evaluation set?

Yes. Start with 50 examples if that’s what you have — a small evaluation is much better than none. Focus on getting the labeling criteria right. Then add to the set incrementally as more labeled data becomes available. Some tasks can use LLM-assisted labeling (a stronger model evaluates the outputs of the model under test), but this introduces its own biases — use it as a supplement to human labels, not a replacement.