AI Legal Research Tools Outperform Attorneys in 200-Question Benchmark Study

AI Legal Research Tools Outperform Attorneys in 200-Question Benchmark Study

Artificial intelligence systems now match or exceed attorney accuracy in legal research, according to a new study that tested four AI tools against practicing lawyers on 200 legal questions.

AI Systems Surpass Lawyer Baseline in Legal Research

Released October 14, the VLAIR Legal Research Study compared four artificial intelligence products—Alexi, Counsel Stack, Midpage, and ChatGPT—with a control group of licensed U.S. lawyers. Each participant answered 200 questions spanning federal and state law. The study analyzed responses collected in July 2025, with results compiled and peer-reviewed over the subsequent months. The AI systems outperformed human lawyers across every scoring category, including accuracy, sourcing, and readability.

AI scores ranged from 74 to 78 percent on a weighted scale, while the Lawyer Baseline averaged 69 percent. Counsel Stack led the field overall. Even ChatGPT, a general-purpose model rather than a dedicated legal tool, scored higher than the human group. The report concludes that “AI can now provide highly accurate and properly cited responses to standard legal research questions.”

A Benchmark Built for Direct Comparison

The VLAIR team developed its benchmark in consultation with six consortium law firms, including Reed Smith, McDermott Will & Emery, and Paul Weiss. The firms provided model questions, reference answers, and authoritative citations drawn from real client work. Independent evaluators from Vanderbilt University’s AI Law Lab, BYU Law, and LegalBenchmarks.ai scored responses without knowing their source.

Each entry was graded for accuracy, authoritativeness, and appropriateness, weighted at 50, 40, and 10 percent, respectively. AI products received identical zero-shot prompts through their APIs and were required to locate and cite relevant law autonomously. Each question was submitted three times to each AI system to minimize output variance. Human lawyers answered the same questions using standard databases such as Westlaw or LexisNexis but without any generative assistance.

Accuracy: Machines Match Human Judgment

Accuracy was the strongest metric across all participants. Legal AI tools averaged 80 percent accuracy, nearly identical to ChatGPT’s performance. Evaluators awarded partial credit when systems acknowledged missing data or explained their reasoning. The report noted that ChatGPT frequently recognized when case law was unavailable, yet still produced legally sound interpretations based on its broader knowledge base—a significant finding about how generalist AI differs from specialist tools.

Lawyers performed consistently but lost points for omission or brevity. Their responses were more concise, often resembling client-ready summaries, but lacked the depth of the reference answers used for scoring. “The lawyers were correct more often than complete,” the report observed, describing the difference as stylistic rather than substantive. The lawyers provided answers that were accurate but incomplete, prioritizing practical utility over comprehensive detail.

Citations: Legal AI Retains Its Edge

Authoritativeness proved to be the key differentiator. Legal AI systems outscored both ChatGPT and the lawyers by an average of six points. Their advantage came from structured databases that restricted output to verified judicial and statutory sources. Unlike ChatGPT, which draws from the open web, these legal-specific platforms limit their sourcing to curated legal databases, ensuring citations can be verified against established legal repositories. Lawyers, by contrast, sometimes cited persuasive but nonbinding materials or secondary summaries.

Legal AI systems also experienced occasional technical failures. Counsel Stack timed out on four questions and Midpage on three, providing no response at all. In other instances, systems acknowledged they could not locate sufficient sources but still offered partial responses or explanations, earning some credit from evaluators. These technical issues were rare across the 200-question dataset but highlight remaining reliability concerns.

ChatGPT’s open web access helped in time-sensitive questions requiring recently enacted laws or guidance. However, the report noted that specialized products performed better overall because their sources could be validated.

Jurisdictional Complexity Still a Challenge

Multi-jurisdiction research proved difficult for both humans and machines. The study included 14 questions requiring review of more than one state’s laws. Scores on these jurisdictionally complex questions averaged 14 points lower than single-jurisdiction questions. A single 50-state survey question in the dataset caused the most failures across all participants. Counsel Stack timed out, Alexi offered incomplete results, and ChatGPT and the lawyers both cited only a limited sample of states rather than comprehensive primary sources. The report suggests that workflow tools tailored for multi-state surveys would significantly improve results.

When AI Outperforms Lawyers

AI products exceeded the Lawyer Baseline on 75 percent of the 200 questions. Where they did, their margin of superiority averaged 31 percentage points. Evaluators found AI particularly effective in identifying relevant law quickly and citing it in full. Its responses were longer but more structured, often including analysis, statutory text, and supporting authority within a single output.

Lawyers, meanwhile, retained an advantage in areas requiring deeper contextual understanding. They outperformed AI in four of ten question categories: those requiring interpretation of ambiguous facts, complex multi-jurisdictional synthesis, judgment-based analysis, and contextual reasoning where the human element created an edge. Human responses also received fewer zero scores, reflecting their tendency to provide at least partial answers even when uncertain. Where lawyers outperformed AI on individual questions, they did so by an average margin of nine percentage points.

Speed and Length in Comparison

Response speed varied by system. Counsel Stack and Midpage were the fastest, while ChatGPT produced the longest and most detailed replies. The study measured latency and word count to estimate productivity, finding that longer outputs often correlated with higher accuracy. Lawyers did not record response times, but the report notes that AI systems returned answers in seconds rather than hours.

Study Period and Methodology

Responses were collected between July 1 and July 21, 2025. Each question was submitted three times to each AI system to minimize output variance. Evaluators assessed every answer against a reference key provided by the consortium firms. Zero scores were reviewed by a third evaluator to ensure fairness. All results were then averaged and weighted to calculate overall scores.

The study followed the same methodology used in the first VLAIR report released in February but separated legal research into its own category to allow deeper analysis. Legal research remains a contested use case for generative models because of prior incidents involving fabricated citations. VLAIR’s approach aimed to measure capability without influence from user prompting or iterative refinement.

Limits and Considerations

The authors acknowledged that no research study can capture the full nuance of legal reasoning. Zero-shot prompts do not reflect the iterative questioning typical of real practice, and differences in database coverage may have affected results. In two questions, jurisdictional references were found to be inaccurate after testing, and those questions were removed. Despite these constraints, evaluators said the findings are statistically significant and reproducible.

VLAIR emphasized that AI tools are still dependent on lawyer oversight for final interpretation. The report warns against assuming that automated accuracy equates to professional competence. It concludes that while AI systems can now produce reliable first drafts of legal research, accountability for their use remains with the supervising attorney.

Broader Context: Regulation and Adoption

The report arrives as several U.S. courts and bar associations reconsider rules governing AI use in legal practice. In 2025, multiple federal districts introduced disclosure requirements for filings assisted by generative systems. The American Bar Association’s Formal Opinion 512 also clarified that competence includes understanding the capabilities and risks of AI tools. VLAIR’s findings are likely to inform these policy discussions by providing data on how such systems actually perform.

Industry adoption continues to accelerate. Law firms have begun integrating legal AI into research and drafting workflows, often pairing the technology with human review. Corporate legal departments are also exploring AI for internal compliance and contract review. VLAIR’s data suggests that these applications may now be reaching reliability thresholds suitable for professional deployment.

A New Definition of Competence

The study’s authors conclude that the line between research automation and professional judgment is narrowing. They describe the results as evidence that legal AI has entered a new phase of maturity. “The technology is no longer experimental,” the report states. The remaining question is how lawyers integrate it responsibly into their work.”

For a profession built on precedent, the implications are significant. Accuracy and citation discipline are no longer exclusive to human researchers. With empirical benchmarks now available, competence in legal research may increasingly mean the ability to direct, verify, and supervise intelligent systems that can already find the law on their own.

This article was prepared for educational and informational purposes only. It does not constitute legal advice and should not be relied upon as such. All study data are publicly available through Vals AI. Readers should consult professional counsel for specific questions regarding AI use in legal practice.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *