Contents

Can synthetic data truly reflect the lived realities of people, especially in underrepresented contexts?

As artificial intelligence continues to reshape the way researchers gather and analyze data, synthetic datasets generated by large language models (LLMs) like ChatGPT and Llama are emerging as promising tools. They’re fast, inexpensive, and easy to deploy. But how accurate are they?

GeoPoll’s latest whitepaper explores this question through a groundbreaking study. We compared AI-generated synthetic data with real-world responses from a Computer-Assisted Telephone Interviewing (CATI) survey in Kenya to assess how well LLMs replicate demographic patterns, media usage, food consumption habits, and more in a low-resource setting.


Inside the Whitepaper

AI synthetic data whitepaper GeoPoll

Some of the areas we covered in this 20+ page report include:

  • How closely synthetic datasets reflect demographic characteristics

  • Where and why LLMs diverge from real-world data (e.g., overestimating internet access via laptops)

  • The influence of prompting language (English vs. Swahili) on data quality

  • Whether known correlations—such as between education and food insecurity—are preserved in synthetic outputs

  • The implications for researchers, policymakers, and data scientists seeking cost-effective data solutions

  • How different LLMs compare

The findings are nuanced: while LLMs sometimes capture surface-level demographic traits, they often fail to reproduce behavioral patterns, cultural context, and complex social dynamics. The conclusion? Synthetic data can be useful, but it’s not a replacement for grounded, contextual data collection—particularly in underrepresented regions.


Why This Research Matters

More and more researchers and institutions are increasingly considering synthetic data a low-cost alternative to survey research, but the risk of drawing flawed conclusions increases—especially in places where context is everything. This paper offers a clear-eyed analysis of:

  • Systemic underrepresentation in training data

  • The limitations of prompt engineering

  • Safeguards and validation protocols for responsible synthetic data use

  • Adaptations to improve AI performance in local contexts (including GeoPoll’s role)

For researchers, these findings offer essential guidance for using synthetic data, responsibly.

Download the Whitepaper

Read the whitepaper (PDF)

For more information, please contact us here.