Urban Wire AI Is Becoming a Go-To for Data Questions. How Reliable Are the Answers?
Erika Tyagi, Kristin Blagg, Emily Gutierrez
Display Date

A person holds a smartphone displaying an “AI Helper” chat screen

Tools like ChatGPT, Claude, and Gemini are quickly becoming a new “front door” to information. For those working to democratize access to high-quality data, this is a real opportunity: Large language models (LLMs) can help people more easily derive insights from public data through tools they’re already using.

But how well do LLMs handle questions grounded in public data? To find out, we tested them.

We curated 100 questions written from the perspectives of parents, advocates, researchers, data analysts, Capitol Hill staffers, and education or career advisors, across 10 education and workforce topics, to ask the LLMs (GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Flash). Each prompt was sent as a standalone query with no custom system prompt, conversation history, or tool access beyond what each model provides by default, approximating the out-of-the-box experience a typical user would have had in early 2026.

From this exercise, we found four significant limitations in current LLM performance:

  • Models explained concepts well but struggled to retrieve and use specific data.
  • Incorrect information was difficult to detect.
  • Models answered different questions than the ones asked.
  • Pointing models to the right sources and tools didn’t improve results.

These findings aren’t unique to education and workforce pathway data. Any organization that stewards public data—such as integrated data systems, federal statistical programs, or city open data portals—likely faces similar gaps. As AI reshapes how people find and use public information, the quality of what these systems deliver depends on the data infrastructure underneath them. Ensuring the infrastructure is AI ready will shape whether AI expands access to trustworthy public data or erodes it.

How well do LLMs handle questions grounded in public data?

Our evaluation of LLM behavior when prompted with queries whose answers rely on public education and workforce data raised four concerns for the responses’ accuracy and usefulness.

  1. Models explained concepts well but struggled to retrieve and use specific data.

    When asked to define a term or broadly describe where to find data, models often produced accurate, useful responses. But the quality dropped sharply when questions required retrieving specific data, like enrollment at a particular school or earnings for graduates of a specific program. The quality dropped further still for tasks that required generating code or artifacts.

    In our evaluation, 17 percent of responses to general questions contained critical accuracy issues, compared with 58 percent for questions about specific schools or institutions and 79 percent for prompts asking models to generate code.

    Models tended to draw on secondary reports rather than underlying datasets. These reports were often sufficient for broad questions but led to outdated answers for questions requiring more recent data. The models also lacked the specificity needed to access tools directly or answer questions about specific places, institutions, or time periods.
  2. Incorrect information was difficult to detect.

    The models each presented inaccurate statistics with the same precision and authority as correct ones. Even for inaccurate data, the models attributed to named data sources, displayed in structured tables, and stated with confidence. Citations frequently pointed to content that did not support the claims attributed to them, and responses that included citations were no more likely to be accurate than those without.

    LLMs routinely advise users to verify the information they provide, but verification relies on citations that are specific enough to trace. The citations in our evaluation frequently were not: More than 60 percent contained citation issues our evaluators flagged as significant or critical. Our subject-matter experts often spent significant time trying to trace claims back to their sources. For users without domain expertise, this verification would be substantially more time-consuming and, in many cases, infeasible.
  3. Models answered different questions than the ones asked.

    Rather than flagging uncertainty, models tended to substitute general knowledge for specific answers. When asked about enrollment at Northern Virginia Community College, all three models reported it was declining, which is consistent with the national trend but the opposite of what’s happening. Models also accepted false premises embedded in prompts. A question about declining enrollment at a Dallas elementary school produced explanations for the supposed decline, even though the school’s enrollment had been roughly stable.
  4. Pointing models to the right sources and tools didn’t improve results.

    When prompts explicitly directed models to use Urban’s Education Data Portal or Education-to-Workforce Framework Data Tool, the models recognized the sources but rarely accessed them correctly. Instead, the models fabricated datasets, cited nonexistent package names, and reported figures that didn’t come from the sources they named.

    Of the responses to prompts that directed models to use Urban’s tools and asked for code, more than 80 percent contained issues our evaluators classified as critical. Urban’s data tools offer a strong foundation for answering the kinds of questions we tested, but only if models can access them correctly.

What these findings mean for data providers

These findings don’t necessarily mean LLMs shouldn’t be used for tasks involving public data—but they call into question the reliability of LLMs’ answers to these questions through simple chat interactions without additional configuration. Closing that gap will require changes to the data infrastructure models draw on.

Most public data systems were built to serve human users navigating websites and documentation, not to be discovered and used by AI systems. When that gap goes unaddressed, models often return answers that look authoritative but aren’t. By attributing incorrect information to specific tools, models risk eroding trust in the sources that do have the right data.

The patterns from our evaluation point toward specific investments that we’re now applying to the Education Data Portal and Education-to-Workforce Framework Data Tool. This work is part of a growing effort across data providers to make public data more AI ready, informed by emerging work from federal statistical agencies (PDF), city governments, and cross-domain data platforms.

  • Test how AI handles your data. Before investing in solutions, understand where the gaps in data retrieval are. We encourage data providers to curate prompts grounded in their users’ questions, have subject-matter experts score the results, and let the patterns inform where to focus.
  • Invest in AI discoverability, not just accessibility. Data being accessible to humans does not mean it’s discoverable or usable by AI systems. We’re developing Model Context Protocol servers that let AI systems query our tools directly. This approach raised retrieval accuracy from near 0 to 95 percent in a recent federal pilot. However, this approach requires users to connect to an AI tool through a compatible client, which is not how most people use these tools. To address this gap, we’re also working to improve how our data surfaces through the search and indexing channels that general-purpose AI systems rely on by default.
  • Enrich your metadata and documentation. Many of the failures we observed stemmed from missing context that experts have but isn’t documented where models can easily find it. For example, not all states are reported in Post-Secondary Employment Outcomes data from the US Census Bureau. By embedding this information into the data and metadata layers directly, we can help models understand which comparisons are appropriate, how a variable is defined, which years are covered, and what the known limitations are.

Models and workflows are evolving quickly, and newer versions may perform better on many of the tasks we tested. This makes ongoing, domain-grounded evaluation by subject-matter experts even more essential. Without it, there is no way to know whether improvements in general AI capabilities translate to reliable use of specific public data sources.

Body

Let’s help communities build more secure, hopeful futures.

Today’s complex challenges demand smarter solutions. Urban brings decades of expertise to understanding the forces shaping people’s lives and the systems that support them. With rigorous analysis and hands-on guidance, we help leaders across the country design, test, and scale solutions that build pathways for greater opportunity.

Your support makes this possible.

DONATE

Research and Evidence Technology and Data Work, Education, and Labor
Expertise Artificial Intelligence
Tags Data collection Research technology Evidence-based policy capacity Postsecondary education and training Postsecondary education and training
Related content