Beyond Buzzwords: Integrating LLMs Into UX Research
From planning to activation, a real-world breakdown of how LLMs can actually support UX research.
This case study translates the key insights from my peer-reviewed publication
âBeyond Buzzwords: The Development of Large Language Models and Their Use in Advertising and Strategic Communication Researchâ
into actionable strategies for UX research. In it, we proposed a conceptual framework for understanding how LLMs are being used in advertising researchâand by extension, how these tools can be meaningfully applied within product and UX research workflows.
Drawing on our PRISMA-guided literature review of 68 empirical studies, we identified three major LLM use cases relevant to UXR:
| Use Case | What It Looks Like in UXR | Opportunity |
|---|---|---|
| LLM Output Testing | Researchers evaluate the quality, tone, or persuasiveness of LLM-generated content | Test how LLMs could augment user-facing microcopy, onboarding flows, or assistive agents |
| LLM as a Tool | LLMs are used to support research operationsâe.g., transcript summarization, diary study synthesis, or survey generation | Reduce researcher load during data collection and early synthesis while speeding up iteration cycles |
| LLM-to-Human Comparison | LLMs stand in for participants (âsilicon samplesâ) to pretest studies or simulate edge cases | Explore how synthetic users might predict or pressure-test user journeys before rollout |
Each category reflects not only a use of LLMs, but a different mental model of what role LLMs should play in research: as a generator, a collaborator, or a proxy.
From Literature to Practice: What LLMs Actually Do for UX Research
While LLM adoption is accelerating across industry, the literature reveals a gap between how these models are used and how well they are understood. In our review, most empirical studies demonstrated feasibilityâshowing that LLMs can generate content, summarize text, or simulate usersâbut fewer addressed when these uses meaningfully improve research outcomes versus introducing hidden risk.
Below, I translate each of the three dominant LLM use cases identified in the literature into concrete UX research applications, highlighting both their value and their limits.
1. LLM Output Testing: Evaluating AI-Generated Experiences
In advertising and communication research, LLM output testing is the most common application. Studies in this category ask a simple question: how good is the content produced by an LLM? Researchers evaluate AI-generated ad copy, headlines, health information, or persuasive messages using human participants, expert judges, or content analysis.
Findings across this literature are consistent: LLM-generated content often performs surprisingly well on surface-level measures like fluency, clarity, and perceived quality. In some cases, participants cannot reliably distinguish AI-generated content from human-authored material. However, deeper issuesâsuch as bias, hallucination, or subtle misinformationâfrequently go unnoticed without expert review.
Translation to UXR: For product teams, this maps directly onto evaluating AI-powered features such as onboarding copy, help center responses, chatbots, search summaries, or recommendation explanations.
Rather than asking âIs this AI good?â, UXR reframes the question as:
- Do users trust AI-generated explanations?
- Where does AI output feel helpful versus uncanny or overconfident?
- Which errors are visible to usersâand which quietly degrade decision-making?
In practice, LLM output testing becomes a form of experience validation, where the researcherâs role is not to optimize language quality alone, but to surface downstream effects on trust, comprehension, and behavior.
2. LLMs as Research Tools: Accelerating (and Reshaping) Research Workflows
The second major category identified in the literature positions LLMs not as objects of study, but as instruments that assist the research process itself. These studies use LLMs to summarize transcripts, classify sentiment, generate survey items, assist with literature reviews, or synthesize qualitative data.
Across both quantitative and qualitative research, LLM tools consistently improved speed and scale. Researchers reported faster synthesis cycles, lower costs, and increased feasibility for large datasets that would otherwise be prohibitive to analyze manually.
However, the literature also surfaces important caveats. LLMs tend to:
- Overrepresent dominant themes while underweighting minority or edge-case perspectives
- Produce confident summaries without transparent attribution
- Mask uncertainty by smoothing over contradictions in participant data
Translation to UXR: In real product teams, this category aligns with day-to-day research operations:
- Summarizing dozens of interview transcripts after a sprint
- Clustering open-ended survey responses
- Drafting research readouts for stakeholders
When applied thoughtfully, LLMs function best as first-pass synthesizers rather than final arbiters of insight. They are effective at pattern surfacing, but not pattern interpretation.
This shifts the researcherâs role from manual coding toward sensemaking, validation, and triangulationâdeciding which patterns matter, which are artifacts of the model, and which warrant deeper investigation.
3. LLM-to-Human Comparison: Simulated Users and âSilicon Samplesâ
The most conceptually provocative category in the literature treats LLMs as stand-ins for human participants. These studies compare LLM-generated responses to human data across classic experiments, persuasive tasks, or consumer decision scenarios.
Results are mixed but revealing. LLMs often approximate average human responses remarkably well, particularly for well-studied populations and mainstream viewpoints. However, they struggle with:
- Novel or rapidly changing contexts
- Marginalized or underrepresented perspectives
- Embodied, emotional, or situational constraints
Translation to UXR: While LLMs should not replace human participants, they offer compelling value earlier in the research lifecycle.
In practice, this looks like:
- Pressure-testing user flows before recruiting participants
- Simulating edge cases to identify blind spots in journey maps
- Pre-validating survey logic or experimental manipulations
Rather than acting as âsynthetic users,â LLMs function best as hypothesis stress-testersârevealing where assumptions break down before real users ever see the product.
Importantly, this use case raises ethical and epistemological questions that mirror those surfaced in advertising research: what does it mean to generalize from a model trained on historical data, and whose experiences are implicitly encodedâor excludedâin that training?
For UX researchers, this reinforces a core principle: LLMs can inform research design, but they cannot replace the lived complexity of human experience.
¡¡¡
Much of the hype around LLMs in UX research collapses wildly different practices into a single narrative: AI will automate research. The literature tells a more nuanced story. What emerges instead is a set of distinct roles that LLMs play at different moments in the research processâsometimes accelerating work, sometimes reshaping it, and sometimes introducing new risks.
To move beyond abstract claims, the following framework situates LLM use within the UX research lifecycle, showing how these tools are currently applied from planning through activation, and where their strengthsâ and limitsâare most pronounced.
Design Implications for UXR Practice
- UX research teams need clearer frameworks for where LLMs add rigor vs. risk
- Token limits, hallucination risk, and training bias must be actively managedânot assumed
- LLMs can assist with activation (e.g., generating highlight reels, gamified debriefs) just as much as collection
- Participant replacement is not the futureâbut pretesting with LLMs could save teams time and budget
Why This Matters
The future of UX research will be shaped not just by what we study, but how we study it. As AI-native companies lean into LLMs, researchers must develop internal literacy around LLM capabilities, constraints, and implementation. This case study draws from academic research to inform applied methods that are scalable, ethically grounded, and deeply aware of the cognitive gaps these tools still present.