2025 LLM Bias Research Study Findings

Explore top LinkedIn content from expert professionals.

Summary

The “2025-llm-bias-research-study-findings” explore how large language models (LLMs), which power AI tools like chatbots and search engines, can show biases in real-world applications—such as healthcare, political discussions, and social services—based on factors like gender, race, income, and the data used to train or evaluate them. These studies help us understand and address how AI models may unintentionally favor some groups or perspectives over others in ways that impact fairness and outcomes.

  • Prioritize diverse testing: Make sure to assess AI models using real-world data that represents different social groups and contexts to uncover potential biases early.
  • Mix judging approaches: Use a combination of human evaluations and independent AI judges to reduce the risk of models favoring their own outputs or replicating existing prejudices.
  • Build with inclusivity: Design and deploy AI systems with the needs of marginalized and low-resource communities in mind, so they don’t reinforce or worsen existing inequalities.
Summarized by AI based on LinkedIn member posts
  • View profile for Robbie Freeman

    Chief Digital Transformation Officer @ Mount Sinai | Digital | AI | Innovation

    11,714 followers

    Insights from our recent publication in Nature Medicine: Sociodemographic Biases in Medical Decision Making by Large Language Models As AI continues to shape clinical decision-making, our study reveals an urgent challenge: LLMs often recommend different care—sometimes more invasive, sometimes less—for patients based on race, gender identity, income, or housing status, even when clinical details are identical. We analyzed 1.7 million AI-generated outputs from 9 leading models. Key findings: 1. LGBTQIA+ patients were 6–7x more likely to be flagged for mental health assessments than clinically warranted 2. High-income patients were steered toward advanced diagnostics; low-income patients often weren’t 3. Black, unhoused, and transgender patients were disproportionately triaged as urgent—even without clinical justification Read the study here 👉 https://lnkd.in/ejnGpGCt As we focus on embedding AI responsibly in healthcare, this work underscores a fundamental truth: we can't separate innovation from equity. It must be baked in from the start. Proud to work alongside brilliant colleagues advancing this vital research: Girish Nadkarni, Alexander Charney, Eyal Klang, Ben Glicksberg, Mahmud Omar, Shelly Soffer, MD, Benjamin Kummer, MD, Carol Horowitz, MD, MPH,Donald Apakama,Reem Agbareia, Nicola Luigi Bragazzi #Equity #HealthcareAI

  • View profile for Andrew Hall

    Prof @ Stanford GSB, Hoover | Studying technology, politics, and decentralized governance

    2,485 followers

    GPT5 remains left-wing slanted according to Americans—substantially more so than Anthropic’s Claude or Google’ Gemini, according to our newly updated research. Yesterday, OpenAI released fascinating new research suggesting that ChatGPT no longer provides left-slanted responses to politically charged questions, on net. It’s fantastic to see a major AI company taking this issue seriously and providing original, transparent evidence on it. We’re especially thrilled to see them evaluate the same issues we studied in our paper, and to do so using a technique similar to the one we developed. Why do we come to different conclusions than they do? We’re not sure yet, but we think there is an important methodological advantage to our study. Specifically, we study actual evaluations of Americans, while OpenAI uses LLM evaluations of their LLM output. When we talk about political bias, what we really care about is how users perceive AI output, not what an LLM thinks about it. LLMs may be able to simulate users’ perceptions in some cases, but in our study, we found that LLM evaluations aren’t a good proxy for how Americans feel about political output from AI models. So it’s worth it to survey actual Americans about this. While OpenAI’s new research does suggest some very interesting dimensions in which ChatGPT may not demonstrate clear ideological slant, when actual Americans use ChatGPT, they still seem to see slant overall. The amount of slant we find for GPT5 is very similar to what we found for GPT4 in the past. Americans perceive it as substantially more slanted than Claude and Gemini, a little more slanted than Meta’s Llama, and less slanted than Grok (which is surprisingly the most left-wing of all the models we study!). We’re excited to continue working on this and thrilled to see companies joining this research agenda. We’ll follow up with a more detailed update on our new results soon.

  • View profile for Aditya Vashistha

    Assistant Professor at Cornell University

    3,259 followers

    New Paper at FAccT 2025: "Beyond Semantics: Examining Gender Bias in LLMs Deployed within Low-resource Contexts in India”. In this paper, co-authored with Urvashi Aneja and Aarushi Gupta, we critically examine how gender bias manifests in LLMs when deployed in critical social sectors like agriculture and healthcare in India. 🔍 What we found: LLMs trained and deployed without gender-aware design often replicate and reinforce patriarchal norms. We identify four dimensions of bias in LLMs: 📌 Content-based bias 📌 Relevance-based bias 📌 Risk-based bias 📌 Accessibility-based bias Most LLMs in our study were built without scoping for women’s needs, evaluating for bias, or designing with low-resource users in mind. Developers prioritized minimum viable products and accuracy over equity—leading to tools that systematically exclude and disadvantage women. Fixing gender bias in LLMs isn’t just a technical challenge. It demands structural change, long-term investment, and feminist design practices grounded in local realities. #AIForSocialGood #GenderBias #ResponsibleAI #LLMs #GlobalSouth #AIandSociety

  • View profile for Kuldeep Singh Sidhu
    Kuldeep Singh Sidhu Kuldeep Singh Sidhu is an Influencer

    Senior Data Scientist @ Walmart | BITS Pilani

    13,328 followers

    Fascinating New Research on LLM Biases in Information Retrieval Evaluation I just read a groundbreaking paper from Google DeepMind researchers that explores a critical issue in modern information retrieval: how the widespread adoption of large language models (LLMs) across different components of search systems might introduce significant biases. The paper "Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation" provides the first empirical evidence that LLM-based judges exhibit substantial bias toward LLM-based rankers - a phenomenon previously only hypothesized but never proven. The researchers conducted experiments using TREC Deep Learning 2019 and 2020 datasets, employing various LLM-based rankers (RankT5, Relevance Generation, Pairwise Relevance Prompting) and comparing them against "oracle" rankers that leverage human relevance judgments. They used different sizes of Gemini models (v1 Nano, v1 Pro, v1.5 Flash, v1.5 Pro) as judges. Their findings reveal three key insights: 1. LLM judges are more lenient in relevance assessments than human judges 2. LLM judges show significant bias toward LLM-based rankers, completely inverting the ranking order compared to human judgments 3. LLM judges struggle to discern subtle but statistically significant performance differences between systems Interestingly, contrary to previous research, their preliminary study found no evidence of bias against AI-generated content when using LLM judges. This research highlights the need for caution when using LLMs for evaluation in information retrieval. The authors provide guidelines for employing LLMs as judges and outline a research agenda to address these challenges. As we increasingly rely on LLMs across the information ecosystem - from content creation to retrieval, ranking, and evaluation - understanding these potential biases becomes crucial for building fair and effective search systems.

  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    162,838 followers

    How biased are LLMs when you use them for synthethic data generation and as LLM as a Judge to evaluate? Answer: Significantly biased. 👀 The “Preference Leakage: A Contamination Problem in LLM-as-a-judge” paper shows that using the same LLM, Family or even previous version can have a preference towards their “own” data. Experiments: 1️⃣ Use LLM (e.g., GPT-4, Gemini) to generate synthetic responses to a set of prompts (e.g., UltraFeedback). 2️⃣ Fine-tune different versions of a "student" models (e.g., Mistral, Qwen) on the synthetic data. 3️⃣ Evaluation: Use multiple "judge" LLMs to perform pairwise comparisons of these student models on benchmark (e.g., Arena-Hard, AlpacaEval 2.0). 4️⃣ Bias: Calculate and Analyze the Preference Leakage Score (PLS) across different scenarios (same model, inheritance, same family) PLS measures how much more often a judge LLM prefers a student model trained on its own data compared to Judge. If both teachers give similar grades to both students = low PLS (fair judging), If teachers give better grades to their own students = high PLS (biased judging). Insights 💡LLMs show a bias towards student models trained on data generated by themselves. 📈 Model size matters: Larger models (14B vs 7B) show stronger preference leakage. 🧪 Supervised fine-tuning (SFT) leads to the highest PLS (23.6%), (DPO) reduces it (5.2%). ❓PLS is higher in subjective tasks, e.g. writing compared to objective ones. 🧑🧑🧒🧒 Relationship bias: Same model > inheritance > same family in terms of leakage severity. 🌊 Data mixing helps but doesn't solve: Even 10% synthetic data shows detectable leakage. ✅ Use multiple independent judges and mix with human evaluation. Paper: https://lnkd.in/eupf2Vyx Github: https://lnkd.in/eeDdrEXb

  • View profile for Zhiyong Lu, PhD FACMI

    Senior Investigator, NIH; Deputy Director for Literature Search, NCBI; Adjunct Professor of CS, UIUC

    3,104 followers

    Pleased to share another work on the #Safety of #LLMs in Medicine series, a key topic in our current medical AI research: Unmasking and quantifying racial bias of large language models (LLMs) in medical report generation. Large language models like GPT-3.5-turbo and GPT-4 hold promise for healthcare professionals, but they may inadvertently inherit biases during their training, potentially affecting their utility in medical applications. Despite few attempts in the past, the precise impact and extent of these biases remain uncertain. To close this gap, we aim to address this uncertainty through both qualitative and quantitive analyses. In this work, we found that current LLMs often projected different costs & hospital stays for white people than people from other racial or ethnicity groups. These biases mirror the disparities in real-world healthcare. Our findings show the need for more research to ensure that inappropriate biases are removed from LLMs to ensure fair and accurate healthcare predictions of possible outcomes for all patients. Great work led by CS student Yifan Yang and Dr. Qiao Jin, together with our collaborators Prof. Furong Huang and Xiaoyu Liu at the University of Maryland College Park.

Explore categories