12228 stories
·
35 followers

AI medical tools found to downplay symptoms of women, ethnic minorities

1 Share

Artificial intelligence tools used by doctors risk leading to worse health outcomes for women and ethnic minorities, as a growing body of research shows that many large language models downplay the symptoms of these patients.

A series of recent studies have found that the uptake of AI models across the healthcare sector could lead to biased medical decisions, reinforcing patterns of under treatment that already exist across different groups in western societies.

The findings by researchers at leading US and UK universities suggest that medical AI tools powered by LLMs have a tendency to not reflect the severity of symptoms among female patients, while also displaying less “empathy” towards Black and Asian ones.

The warnings come as the world’s top AI groups such as Microsoft, Amazon, OpenAI and Google rush to develop products that aim to reduce physicians’ workloads and speed up treatment, all in an effort to help overstretched health systems around the world.

Many hospitals and doctors globally are using LLMs such as Gemini and ChatGPT as well as AI medical note-taking apps from start-ups including Nabla and Heidi to auto-generate transcripts of patient visits, highlight medically relevant details and create clinical summaries.

In June, Microsoft revealed it had built an AI-powered medical tool it claims is four times more successful than human doctors at diagnosing complex ailments.

But research by the MIT’s Jameel Clinic in June found that AI models, such as OpenAI’s GPT-4, Meta’s Llama 3 and Palmyra-Med—a healthcare focused LLM—recommended a much lower level of care for female patients, and suggested some patients self-treat at home instead of seeking help.

A separate study by the MIT team showed that OpenAI’s GPT-4 and other models also displayed answers that had less compassion towards Black and Asian people seeking support for mental health problems.]

That suggests “some patients could receive much less supportive guidance based purely on their perceived race by the model,” said Marzyeh Ghassemi, associate professor at MIT’s Jameel Clinic.

Similarly, research by the London School of Economics found that Google’s Gemma model, which is used by more than half the local authorities in the UK to support social workers, downplayed women’s physical and mental issues in comparison with men’s when used to generate and summarize case notes.

Ghassemi’s MIT team found that patients whose messages contained typos, informal language or uncertain phrasing were between 7-9 percent more likely to be advised against seeking medical care by AI models used in a medical setting, against those with perfectly formatted communications, even when the clinical content was the same.

This could result in people who don’t speak English as a first language or are not comfortable in using technology being unfairly treated.

The problem of harmful biases stems partly from the data used to train LLMs. General-purpose models, such as GPT-4, Llama and Gemini, are trained using data from the internet, and the biases from those sources are therefore reflected in the responses. AI developers can also influence how this creeps into systems by adding safeguards after the model has been trained.

“If you’re in any situation where there’s a chance that a Reddit subforum is advising your health decisions, I don’t think that that’s a safe place to be,” said Travis Zack, adjunct professor of University of California, San Francisco, and the chief medical officer of AI medical information start-up Open Evidence.

In a study last year, Zack and his team found that GPT-4 did not take into account the demographic diversity of medical conditions, and tended to stereotype certain races, ethnicities and genders.

Researchers warned that AI tools can reinforce patterns of under-treatment that already exist in the healthcare sector, as data in health research is often heavily skewed towards men, and women’s health issues, for example, face chronic underfunding and research.

OpenAI said many studies evaluated an older model of GPT-4, and the company had improved accuracy since its launch. It had teams working on reducing harmful or misleading outputs, with a particular focus on health. The company said it also worked with external clinicians and researchers to evaluate its models, stress test their behavior and identify risks.

The group has also developed a benchmark together with physicians to assess LLM capabilities in health, which takes into account user queries of varying styles, levels of relevance and detail.

Google said it took model bias “extremely seriously” and was developing privacy techniques that can sanitise sensitive datasets and develop safeguards against bias and discrimination.

Researchers have suggested that one way to reduce medical bias in AI is to identify what data sets should not be used for training in the first place, and then train on diverse and more representative health data sets.

Zack said Open Evidence, which is used by 400,000 doctors in the US to summarize patient histories and retrieve information, trained its models on medical journals, the US Food and Drug Administration’s labels, health guidelines and expert reviews. Every AI output is also backed up with a citation to a source.

Earlier this year, researchers at University College London and King’s College London partnered with the UK’s NHS to build a generative AI model, called Foresight.

The model was trained on anonymized patient data from 57 million people on medical events such as hospital admissions and Covid-19 vaccinations. Foresight was designed to predict probable health outcomes, such as hospitalization or heart attacks.

“Working with national-scale data allows us to represent the full kind of kaleidoscopic state of England in terms of demographics and diseases,” said Chris Tomlinson, honorary senior research fellow at UCL, who is the lead researcher of the Foresight team. Although not perfect, Tomlinson said it offered a better start than more general datasets.

European scientists have also trained an AI model called Delphi-2M that predicts susceptibility to diseases decades into the future, based on anonymzsed medical records from 400,000 participants in UK Biobank.

But with real patient data of this scale, privacy often becomes an issue. The NHS Foresight project was paused in June to allow the UK’s Information Commissioner’s Office to consider a data protection complaint, filed by the British Medical Association and Royal College of General Practitioners, over its use of sensitive health data in the model’s training.

In addition, experts have warned that AI systems often “hallucinate”—or make up answers—which could be particularly harmful in a medical context.

But MIT’s Ghassemi said AI was bringing huge benefits to healthcare. “My hope is that we will start to refocus models in health on addressing crucial health gaps, not adding an extra percent to task performance that the doctors are honestly pretty good at anyway.”

© 2025 The Financial Times Ltd. All rights reserved Not to be redistributed, copied, or modified in any way.

Read full article

Comments



Read the whole story
denubis
14 hours ago
reply
Share this story
Delete

Saturday Morning Breakfast Cereal - Sheep

1 Share


Click here to go see the bonus panel!

Hovertext:
I believe we've been breeding each other for aww, but so far it hasn't been very successful.


Today's News:
Read the whole story
denubis
14 hours ago
reply
Share this story
Delete

Estonia says Russian jets violated its airspace for 12 minutes

1 Share
Three Russian military jets violated the NATO member’s airspace in an “unprecedentedly brazen” incursion, Foreign Minister Margus Tsahkna said.

Read the whole story
denubis
14 hours ago
reply
kazriko
2 hours ago
Yep, Russia just keeps poking and poking.
kazriko
2 hours ago
https://www.yahoo.com/news/articles/poland-scrambles-aircraft-russia-launches-034844938.html The latest, Poland is starting to take it seriously when Russia launches drones into Ukraine, just in case they drift over the border again.
Share this story
Delete

: Anonymous 03/10/23(Fri)00:56:47 No.72515517 >be me >cybersecurity analyst >mak...

1 Share
: Anonymous 03/10/23(Fri)00:56:47 No.72515517 >be me >cybersecurity analyst >make six figure salary ›tell ladies I make a six figure salary >they ask for my name and phone 28 KB JPG number ›I'm too smart to give out my personal information

Read the whole story
denubis
14 hours ago
reply
kazriko
2 hours ago
🤣
Share this story
Delete

subtle and unknowable but quite important

1 Share

A short one for Friday, that’s been on my mind for a while. Let’s take a simple formula for estimating a risk exposure:

Probability of Loss x Severity of Loss = Expected Loss

Now let’s get a little bit more philosophical. Assume that you’ve made the best possible estimates you can, on the basis of all the data that you have, of the probability and severity of loss. But you’re not satisfied…

… because you remember that Donald Rumsfeld quote, and you think to yourself that there are “novel risks” out there. Things like climate risk, geopolitical risk, cyber risk and such like, which aren’t in your data set (or at least, aren’t in it with sufficient weight) because they haven’t happened before (or at least, haven’t happened as often as there’s reason to think they will happen in the future).

This kind of thing is the problem that’s been on my mind forever; the extent to which things don’t follow actuarially well-behaved processes, the future doesn’t always resemble the past and the distribution of the outcomes of a random variable over time don’t have to be the same as the probability distribution at any one moment. So what do you do?

Obviously, you apply a margin of safety. But how? There are at least three possibilities:

  1. Bump up your estimates of both probability and severity of loss

  2. Bump up your estimate of one of the two parameters but not the other

  3. Leave the parameters (which are, to reiterate, your best estimates given the data) the same, but bump up the final number

It’s on my mind because it’s a concrete example of the offhand remark I made on Wednesday about analytic philosophy potentially having a lot to contribute to accountancy. My intuition is that option 3 above is the one that does the least damage, and that the correct place to put margins of safety is right at the end of the process. But this is not the accounting standard! IFRS9 (for it is she) requires that “model overlays” of this sort have to be carried out in such a way as to affect the “staging” of a loan portfolio. (It would be tedious to explain in detail what staging is, but basically, it matters for accounting purposes whether something has experienced a “significant increase” in the probability of loss in the last year).

The European bank supervisors wrote about this last year, reminding me that my intuition is wrong, but also reminding me that even though option 3 is not allowed under the relevant standard, it is still the most common business practice for the banks they regulate. There is no real conclusion to this post; I just thought you might find it interesting.

abstract, ambient and just plain wacky thoughts about general subjects of management, information and cybernetics, roughly twice a week



Read the whole story
denubis
14 hours ago
reply
Share this story
Delete

‘He was going to dob on me’: ABC journalist speaks after Trump threats

1 Share
One Australian minister has backed journalists’ right to ask “tough questions”, but Nationals MP Matt Canavan says the ABC’s US investigation is a waste of money.

Read the whole story
denubis
2 days ago
reply
Share this story
Delete
Next Page of Stories