Whitepaper: What do Bots Eat for Breakfast?

Contents

Can synthetic data truly reflect the lived realities of people, especially in underrepresented contexts?

As artificial intelligence continues to reshape the way researchers gather and analyze data, synthetic datasets generated by large language models (LLMs) like ChatGPT and Llama are emerging as promising tools. They’re fast, inexpensive, and easy to deploy. But how accurate are they?

GeoPoll’s latest whitepaper explores this question through a groundbreaking study. We compared AI-generated synthetic data with real-world responses from a Computer-Assisted Telephone Interviewing (CATI) survey in Kenya to assess how well LLMs replicate demographic patterns, media usage, food consumption habits, and more in a low-resource setting.

Inside the Whitepaper

Some of the areas we covered in this 20+ page report include:

How closely synthetic datasets reflect demographic characteristics
Where and why LLMs diverge from real-world data (e.g., overestimating internet access via laptops)
The influence of prompting language (English vs. Swahili) on data quality
Whether known correlations—such as between education and food insecurity—are preserved in synthetic outputs
The implications for researchers, policymakers, and data scientists seeking cost-effective data solutions
How different LLMs compare

The findings are nuanced: while LLMs sometimes capture surface-level demographic traits, they often fail to reproduce behavioral patterns, cultural context, and complex social dynamics. The conclusion? Synthetic data can be useful, but it’s not a replacement for grounded, contextual data collection—particularly in underrepresented regions.

Why This Research Matters

More and more researchers and institutions are increasingly considering synthetic data a low-cost alternative to survey research, but the risk of drawing flawed conclusions increases—especially in places where context is everything. This paper offers a clear-eyed analysis of:

Systemic underrepresentation in training data
The limitations of prompt engineering
Safeguards and validation protocols for responsible synthetic data use
Adaptations to improve AI performance in local contexts (including GeoPoll’s role)

For researchers, these findings offer essential guidance for using synthetic data, responsibly.

Download the Whitepaper

For more information, please contact us here.

Whitepaper: What do Bots Eat for Breakfast?

Frankline Kibuacha | Dec. 07, 2024 | 1 min. read

Inside the Whitepaper

Why This Research Matters

Download the Whitepaper