publications and projects
2026
- In SubmissionAuditing LLM Responses in a Complex Policy Landscape: Abortion Law in the United StatesRo Encarnación, Christen Hammock Jones, and Danaé Metaxa2026
Inaccurate LLM responses to questions tied to law and policy can dissuade people from exercising their rights. U.S. abortion law is a salient case study in which answers to high-stakes questions depend on a changing and fragmented policy landscape. We audited OpenAI ChatGPT-4o, Google Gemini 2.0 Flash, and Perplexity Sonar using a range of realistic abortion-related questions, systematically varying wording and U.S. state. We evaluated the resulting 54,704 model responses, comparing them against ground truth data compiled from state abortion statutes. Across models, average accuracy was 78%. Using pregnancy rate data, we estimated that inaccurate model responses could potentially expose over 1 million pregnant people in the U.S. to misleading abortion information. We also observed that models sycophantically affirmed question framing over legal correctness, and overall accuracy varied by question phrasing and state, with greater inaccuracy in more abortion-restrictive states. Based on these findings, we argue for model transparency, longitudinal evaluation of LLMs, and multi-stakeholder collaboration to prevent failures when LLMs mediate access to critical legal information.
- In SubmissionEveryday Auditing of TikTok’s Generative AI Manga FilterRo Encarnación, Luis Morales-Navarro, Hita Kambhamettu, and 1 more author2026
Recent work in AI auditing has begun to explore the potential for non-expert users to be engaged in auditing. Crowdsourced audits, everyday audits, and end-user audits are all types of user-engaged AI auditing in which one or more users (with varying degrees of involvement and intention) investigate instances of harmful algorithm behavior. We extend this literature through a case study of the AI Manga filter on TikTok, a generative AI-based filter on the platform that turns users’ photos into computer-generated manga-style illustrations. We show that TikTok users’ everyday auditing behaviors in our descriptive generative AI case study allow them to uncover algorithmic biases, including race and gender stereotypes, as well as problematic behaviors including anthropomorphization of inanimate objects and hypersexualization of human figures. We also analyze their methods for doing so; while most prior audits explicitly focus on identifying harms, here we found that users were largely motivated by fun and curiosity. We discuss the implications of these everyday generative AI audits for the field, including the need to consider alternative (and possibly unconventional) frameworks and methods for auditing non-deterministic technologies, and to move beyond normative ideas of harm. We argue for collaboratively co-engaging community users, auditors, and researchers in their conceptions of harm to guide future evaluations of generative AI systems.
2025
- Can an LLM Tell Me If I Can Legally Get an Abortion?Ro Encarnación and Danaé MetaxaHEAL @ CHI 2025 Workshop – Human-centered Evaluation and Auditing of Language Models, 2025
- 🏆 Best Paper Honorable MentionAuditing the Audits: Lessons for Algorithmic Accountability from Local Law 144’s Bias AuditsMarissa Kumar Gerchick, Ro Encarnación, Cole Tanigawa-Lau, and 3 more authorsProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 2025
Best Paper Honorable Mention
In this work, we “audit the audits,” analyzing the documents produced pursuant to one of the United States’ first enacted laws regulating the use of artificial intelligence in employment: New York City’s Local Law 144. This law requires employers and employment agencies using certain types of automated tools to publish “bias audits” with statistics about how different sex and racial groups fare in the hiring process when the tools are used. We collect and conduct a comprehensive analysis of all Local Law 144 bias audits (N=116) made publicly available to our knowledge from the law taking effect in July 2023 until early November 2024, and describe the extensive challenges we faced in identifying, archiving, extracting information from, and ultimately analyzing these bias audits. We identify several ways that bias audits produced in accordance with Local Law 144 are incomplete evaluations of algorithmic bias, despite news coverage and characterizations by employers and vendors suggesting otherwise. We show that Local Law 144 bias audits are significantly hampered by several issues, including missing demographic data, opaque data aggregation, problematic uses of “test data,” and reliance on metrics that do not represent how automated hiring tools are used in practice. We analyze the reported results in Local Law 144 bias audits alongside the four-fifths rule often used as a measure for assessing adverse impact in employment contexts. Most audits do not report results that would not suggest violations of the four-fifths rule. Crucially, however, we show that these tools could often be in violation of the four-fifths rule when considering potential impacts of missing demographic data. We offer ten practical recommendations to strengthen future legislative efforts that mandate algorithm auditing in hiring and other areas, and contribute an open dataset and codebase for extracting and combining bias audit results to support future auditing efforts.
2023
- Representation, Self-Determination, and Refusal: Queer People’s Experiences with Targeted AdvertisingPrincess Sampson, Ro Encarnación, and Danaé MetaxaProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, Chicago, IL, USA, 2023
Targeted online advertising systems increasingly draw scrutiny for the surveillance underpinning their collection of people’s private data, and subsequent automated categorization and inference. The experiences of LGBTQ+ people, whose identities call into question dominant assumptions about who is seen as “normal,” and deserving of privacy, autonomy, and the right to self-determination, are a fruitful site for exploring the impacts of ad targeting. We conducted semi-structured interviews with LGBTQ+ individuals (N=18) to understand their experiences with online advertising, their perceptions of ad targeting, and the interplay of these systems with their queerness and other identities. Our results reflect participants’ overall negative experiences with online ad content—they described it as stereotypical and tokenizing in its lack of diversity and nuance. But their desires for better ad content also clashed with their more fundamental distrust and rejection of the non-consensual and extractive nature of ad targeting. They voiced privacy concerns about continuous data aggregation and behavior tracking, a desire for greater control over their data and attention, and even the right to opt-out entirely. Drawing on scholarship from queer and feminist theory, we explore targeted ads’ homonormativity in their failure to represent multiply-marginalized queer people, the harms of automated inference and categorization to identity formation and self-determination, and the theory of refusal underlying participants’ queer visions for a better online experience.
2022
- Adaptive Sampling Strategies to Construct Equitable Training DatasetsWilliam Cai, Ro Encarnación, Bobbie Chern, and 4 more authors2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 2022
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data—an application domain that often suffers from non-representative data collection. When optimizing policies for overall or group-specific average health, we find that our adaptive approach outperforms heuristic strategies, including equal and representative sampling. In this sense, equal treatment with respect to sampling decisions does not guarantee equal or equitable outcomes.