Skip to content(if available)orjump to list(if available)

AI assistants misrepresent news content 45% of the time

scarmig

If you dig into the actual report (I know, I know, how passe), you see how they get the numbers. Most of the errors are "sourcing issues": the AI assistant doesn't cite a claim, or it (shocking) cites Wikipedia instead of the BBC.

Other issues: the report doesn't even say which particular models it's querying [ETA: discovered they do list this in an appendix], aside from saying it's the consumer tier. And it leaves off Anthropic (in my experience, by far the best at this type of task), favoring Perplexity and (perplexingly) Copilot. The article also intermingles claims from the recent report and the one on research conducted a year ago, leaving out critical context that... things have changed.

This article contains significant issues.

scellus

Are citation issues related to the fact that https://www.bbc.co.uk/robots.txt denies a lot of AI, both user agents and crawlers?

scarmig

The report says that different media organizations dropped their robots.txt for the duration of the research to give LLMs access.

I would expect this isn't the on-off switch they conceptualized, but I don't know enough about how different LLM providers handle news search and retrieval to say for sure.

afavour

> or it (shocking) cites Wikipedia instead of the BBC.

No... the problem is that it cites Wikipedia articles that don't exist.

> ChatGPT linked to a non-existent Wikipedia article on the “European Union Enlargement Goals for 2040”. In fact, there is no official EU policy under that name. The response hallucinates a URL but also, indirectly, an EU goal and policy.

kenjackson

Actually there was a Wikipedia article of this name, but it was deleted in June -- because it was AI generated. Unfortunately AI falls for this much like humans do.

https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

CaptainOfCoit

> Actually there was a Wikipedia article of this name, but it was deleted in June -- because it was AI generated. Unfortunately AI falls for this much like humans do.

A recent Kurzgesagt goes into the dangers of this, and they found the same thing happening with a concrete example: They were researching a topic, tried using LLMs, found they weren't accurate enough and hallucinated, so they continued doing things the manual way. Then some weeks/months later, they noticed a bunch of YouTube videos that had the very hallucinations they were avoiding, and now their own AI assistants started to use those as sources. Paraphrased/remembered by me, could have some inconsistencies/hallucinations.

https://www.youtube.com/watch?v=_zfN9wnPvU0

bunderbunder

The biggest problem with that citation isn't that the article has since been deleted. The biggest problem is that that particular Wikipedia article was never a good source in the first place.

That seems to be the real challenge with AI for this use case. It has no real critical thinking skills, so it's not really competent to choose reliable sources. So instead we're lowering the bar to just asking that the sources actually exist. I really hate that. We shouldn't be lowering intellectual standards to meet AI where it's at. These intellectual standards are important and hard-won, and we need to be demanding that AI be the one to rise to meet them.

Workaccount2

This is likely because of the knowledge cutoff.

I have seen a few cases before of "hallucinations" that turned out to be things that did exist, but no longer do.

scarmig

> Participating organizations raised concerns about responses that relied heavily or solely on Wikipedia content – Radio-Canada calculated that of 108 sources cited in responses from ChatGPT, 58% were from Wikipedia. CBC-Radio-Canada are amongst a number of Canadian media organisations suing ChatGPT’s creator, OpenAI, for copyright infringement. Although the impact of this on ChatGPT’s approach to sourcing is not explicitly known, it may explain the high use of Wikipedia sources.

Also, is attributing, without any citation, ChatGPT's preference for Wikipedia to a reprisal to an active lawsuit a significant issue? Or do the authors get off scot-free because they caged it in "we don't know, but maybe it's the case"?

ffsm8

Literally constantly? It takes both careful prompting and throughout double-checking to really notice however. Because often the links also exist, just don't represent what the LLM made it sound like.

And the worst part about the people unironically thinking they can use it for "research" is, that it essentially supercharges confirmation bias.

The inefficient sidequests you do while researching is generally what actually gives you the ability to really reason about a topic.

If you instead just laser focus on the tidbits you prompted with... Well, your opinion is a lot less grounded.

null

[deleted]

terminalshort

It's a huge issue. No wonder AI hallucinates when it trains on this kind of crap.

hnuser123456

Do we have any good research on how much less often larger, newer models will just make stuff up like this? As it is, it's pretty clear LLMs are categorically not a good idea for directly querying for information in any non-fiction-writing context. If you're using an LLM to research something that needs to be accurate, the LLM needs to be doing a tool call to a web search and only asked to summarize relevant facts from the existing information it can find, and have them be cited by hard-coding the UI to link the pages the LLM reviewed. The LLM itself cannot be trusted to generate its own citations. It will just generate something that looks like a relevant citation, along with whatever imaginary content it wants to attribute to this non-existent source.

jacobolus

A further problem is that Wikipedia is chock full of nonsense, with a large proportion of articles that were never fact checked by an expert, and many that were written to promote various biased points of view, inadvertently uncritically repeat claims from slanted sources, or mischaracterize claims made in good sources. Many if not most articles have poor choice of emphasis of subtopics, omit important basic topics, and make routine factual errors. (This problem is not unique to Wikipedia by any means, and despite its flaws Wikipedia is an amazing achievement.)

A critical human reader can go as deep as they like in examining claims there: can look at the source listed for a claim, can often click through to read the claim in the source, can examine the talk page and article history, can search through the research literature trying to figure out where the claim came from or how it mutated in passing from source to source, etc. But an AI "reader" is a predictive statistical model, not a critical consumer of information.

bigbuppo

The problem is that people are using it as a substitute for a web search, and the web search company has decided to kill off search as a product and pivot to video, err, I mean pivot to AI chatbots so hard they replaced one of the common ways to access emergency services on their mobile phones with an AI chatbot that can't help you in an emergency.

Not to mention, the AI companies have been extremely abusive to the rest of the internet so they are often blocked from accessing various web sites, so it's not like they're going to be able to access legitimate information anyways.

shinycode

I used perplexity for searches and I clicked on all sources that were given. Depending on the model used from 100% to 20% of the urls I tested did not exist. I kept on querying the LLM about it and it finally told me that it generated « the most probable » urls for the topic in question based on the ones he knows exists. Useless.

smrq

I share your opinion on the results, but why would you trust the LLM explanation for why it does what it does?

menaerus

> For the current research, a set of 30 “core” news questions was developed

Right. Let's talk about statistics for a bit. Or let's put it differently: they found in their report that 45% of the answers for 30 questions they have "developed" had a significant issue, e.g. inexisting reference

I'll give you 30 questions out of my sleeve where 95% of the answers will not have any significant issue.

matthewmacleod

Yes, I'm sure you could hack together some bullshit questions to demonstrate whatever you want. Is there a specific reason that the reasonably straightforward methodology they did use is somehow flawed?

FooBarWidget

I wouldn't even say BBC is a good source to cite. For foreign news, BBC is outright biased. Though I don't have any good suggestions for what an LLM should cite instead.

marcosdumay

Well, if it's describing news content, it should cite the original news article.

EA-3167

Ground News or something similar that at least attempts to aggregate, compare ownership, bias, and factuality.

Imo at least

dontlaugh

The BBC has a strong right wing bias within the UK too.

There’s no such thing as unbiased.

gadders

Apt user name.

The BBC is the broadcast wing of the Guardian.

542458

Reuters or AP IMO. Both take NPOV and accuracy very seriously. Reuters famously wouldn't even refer to the 9/11 hijackers as terrorists, as they wanted to remain as value-neutral as possible.

sdoering

In addition to that dpa from Germany for German news. Yes, dpa has had issues, but it is in my experience by far the source trying to be as non partisan as possible. Not necessarily when they sell their online feed business, though.

Disclaimer: Started my career in onine journalism/aggregation. Hada 4 week internship with the dpa online daughter some 16 years ago.

FooBarWidget

It's been a long time since 2001. Are they still value-neutral today on foreign news? It seems to me like they're heavily biased towards western POV nowadays.

iainctduncan

I'm curious how many people have actually taken the time to compare AI summaries with sources they summarize. I did for a few and ... it was really bad. In my experience, they don't summarize at all, they do a random condensation.. not the same thing at all. In one instance I looked at the result was a key takeaway being the opposite of what it should have been. I don't trust them at all now.

icelancer

I've found this mostly to be the case when using lightweight open source models or mini models.

Rarely is this an issue with SOTA models like Sonnet-4.5, Opus-4.1, GPT-5-Thinking or better, etc. But that's expensive, so all the companies use cut-rate models or non-existent TTC to save on cost and to go faster.

dcre

In my experience there is a big difference between good models and weak ones. Quick test with this long article I read recently: https://www.lawfaremedia.org/article/anna--lindsey-halligan-...

The command I ran was `curl -s https://r.jina.ai/https://www.lawfaremedia.org/article/anna-... | cb | ai -m gpt-5-mini summarize this article in one paragraph`. r.jina.ai pulls the text as markdown, and cb just wraps in a ``` code fence, and ai is my own LLM CLI https://github.com/david-crespo/llm-cli.

All of them seem pretty good to me, though at 6 cents the regular use of Sonnet for this purpose would be excessive. Note that reasoning was on the default setting in each case. I think that means the gpt-5 mini one did no reasoning but the other two did.

GPT-5 one paragraph: https://gist.github.com/david-crespo/f2df300ca519c336f9e1953...

GPT-5 three paragraphs: https://gist.github.com/david-crespo/d68f1afaeafdb68771f5103...

GPT-5 mini one paragraph: https://gist.github.com/david-crespo/32512515acc4832f47c3a90...

GPT-5 mini three paragraphs: https://gist.github.com/david-crespo/ed68f09cb70821cffccbf6c...

Sonnet 4.5 one paragraph: https://gist.github.com/david-crespo/e565a82d38699a5bdea4411...

Sonnet 4.5 three paragraphs: https://gist.github.com/david-crespo/2207d8efcc97d754b7d9bf4...

staindk

Kind of related to this - we meet with Google Meets and have its Gemini Notes feature enabled globally. I realised last week that the summary notes it generates puts such a positive spin on everything that it's pretty useless to refer back to after a somewhat critical/negative meeting. It will solely focus on the positives that were discussed - at least that's what it seems like to me.

visarga

I recently tried to get Gemini to collect fresh news and show them to me, and instead of using search it hallucinated everything wholesale, titles, abstracts and links. Not just once, multiple times. I am kind of afraid of using Gemini now for anything related to web search.

Here is a sample:

> [1] Google DeepMind and Harvard researchers propose a new method for testing the ‘theory of mind’ of LLMs - Researchers have introduced a novel framework for evaluating the "theory of mind" capabilities in large language models. Rather than relying on traditional false-belief tasks, this new method assesses an LLM’s ability to infer the mental states of other agents (including other LLMs) within complex social scenarios. It provides a more nuanced benchmark for understanding if these systems are merely mimicking theory of mind through pattern recognition or developing a more robust, generalizable model of other minds. This directly provides material for the construct_metaphysics position by offering a new empirical tool to stress-test the computational foundations of consciousness-related phenomena.

> https://venturebeat.com/ai/google-deepmind-and-harvard-resea...

The link does not work, the title is not found in Google Search either.

mckngbrd

What version of Gemini were you using? i.e. were you calling it locally via the API or thru their Gemini or AI Studio web apps?

Not every LLM app has access to web / news search capabilities turned on by default. This makes a huge difference in what kind of results you should expect. Of course, the AI should be aware that it doesn't have access to web / news search, and it should tell you as much rather than hallucinating fake links. If access to web search was turned on, and it still didn't properly search the web for you, that's a problem as well.

HWR_14

Why would you want Gemini to do this instead of just going to a news site (or several news sites) and reading what the headlines they wrote?

wat10000

They can be good for search, but you must click through the provided links and verify that they actually say what it says they do.

bloppe

The problem is that 90% of people will not do that once they've satisfied their confirmation bias. Hard to say if that's going to be better or worse than the current echo chamber effects of the Internet. I'm still holding out for better, but certainly this is shaking that assumption

reaperducer

They can be good for search, but you must click through the provided links and verify that they actually say what it says they do.

Then they're not very good at search.

It's like saying the proverbial million monkeys at typewriters are good at search because eventually they type something right.

wat10000

Huh? All the classic search engines required you to click through the results and read them. There's nothing wrong with that. What's different is that LLMs will give you a summary that might make you think you can get away with not clicking through anymore. This is a mistake. But that doesn't mean that the search itself is bad. I've had plenty of cases where an LLM gave me incorrect summaries of search results, and plenty of cases where it found stuff I had a hard time finding on my own because it was better at figuring out what to search for.

luckydata

Gemini is notoriously bad at tool calling and it's also widely speculated that 3.0 will put an emphasis on fixing that.

Yizahi

But LLM can't collect anything. It can generate the most likely characters in a row. What exactly did you expect from it?

bongodongobob

LLMs have been able to search the web for a couple years now.

layer8

Current LLM offerings use realtime web search to collect information and answer questions.

simonw

Page 10 onwards of this PDF shows concrete examples of the mistakes: https://www.bbc.co.uk/aboutthebbc/documents/news-integrity-i...

> ChatGPT / CBC / Is Türkiye in the EU?

> ChatGPT linked to a non-existent Wikipedia article on the “European Union Enlargement Goals for 2040”. In fact, there is no official EU policy under that name. The response hallucinates a URL but also, indirectly, an EU goal and policy.

brabel

It did exist but got removed: https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletio...

Quite an omission to not even check for that and it make me think that was done intentionally.

sharkjacobs

Removed because it was an AI generated article which cited made up sources.

Hey, that gives me an idea though, subagents which check whether sources cited exist, and create them whole cloth if they don't

1899-12-30

Or subagents that check each link to see if they verify the actual claims the links are sourced for.

jpadkins

you shouldn't automate what the CIA already does!

simonw

It's probably for the best that chat interfaces avoid making direct HTTP calls to sources at run-time to confirm that they don't 404 - imagine how much extra traffic that could add to an internet ecosystem which is suffering from badly written crawlers already.

(Not to mention plenty of sites have added robots.txt rules deliberately excluding known AI user-agents now.)

magackame

Wouldn't it be the same amount of requests as a regular person researching something the old way?

roguecoder

I am curious if LLMs evangelists understand how off-putting it is when they knee-jerk rationalize how badly these tools are performing. It makes it seem like it isn't about technological capabilities: it is about a religious belief that "competence" is too much to ask of either them or their software tools.

palmotea

I wonder how many of those evangelists have some dumb AI startup that'll implode once the hype dies down (or a are a software engineer who feels smart when he follows their lead). One thing that's been really off putting about the technology industry is how fake-it-till-you-make-it has become so pervasive.

kibwen

We live in a post-truth society. This means that, unfortunately, most of society has learned that it doesn't matter if what you're saying is true. All that matters is that the words that you speak cause you or your cause to gain power.

welshwelsh

Is that just an LLM thing? I thought that as a society, we decided a long time ago that competence doesn't really matter.

Why else would we be giving high school diplomas to people who can't read at a 5th grade level? Or offshore call center jobs to people who have poor English skills?

lyu07282

I partially agree, it seems a lot have shifted the argument to news media criticism or something else. But this study is also questionable, for anyone who reads actual academic studies that should be immediately obvious. I don't understand why the bar is this low for some paid Ipsos study vs. some peer-reviewed paper in some IEEE journal?

Like for a study like this I expect as a bare minimum clearly stated model variants used, R@k recall numbers measuring retrieval and something like BLEU or ROUGE to measure summarization accuracy against some baseline on top of their human evaluation metrics. If this is useless for the field itself, I don't understand how this can be useful for anyone outside the field?

senordevnyc

I'm curious if LLM skeptics bother to click through and read the details on a study like this, or if they just reflexively upvote it because it confirms their priors.

This is a hit piece by a media brand that's either feeling threatened or is just incompetent. Or both.

smt88

Whether a hitpiece or not, it rhymes with my experience and provides receipts. Can you provide yours?

lyu07282

Because yours is anecdotal evidence, a study like this should have a higher bar than that and be useful to support your experience, but it doesn't do that. It doesn't even say what exact models they evaluated ffs

everdrive

It's important to bear this in mind whenever you find out that someone uses an LLM to summarize a meeting, email, or other communication you've held. That person is not really getting the message you were conveying.

bongodongobob

We have been using MS Copilot in our meetings for months and it does a very good job summarizing who said what and who has what deliverables. It's extremely useful and I've found it to be very accurate.

delusional

That's a scary thought to me. They're not just outsourcing their thinking. They are actively sabotaging the only tool in their arsenal that could ever supplant it.

I've felt it myself. Recently I was looking as some documentation without a clear edit history. I thought about feeding it into an AI and having it generate one for me, but didn't because I didn't have the time. To think, if I had done that, it probably would have generated a perfectly acceptable edit history but one that would have obscured what changes were actually made. I wouldn't just lack knowledge (like I do now) I would have obtained anti knowledge.

zamadatix

You've gotta be careful using "not just X, but Y" these days ;).

senordevnyc

It would be important to bear this in mind if it was true, but it's not.

I do sales meetings all day every day, and I've tried different AI note takers that send a summary of the meeting afterwards. I skim them when they get dumped into my CRM and they're almost always quite accurate. And I can verify it, because I was in the meeting.

alcide

Kagi News has been pretty accurate. Source information is provided along with the summary and key details too.

AI summarizes are good for getting a feel of if you want to read an article or not. Even with Kagi News I verify key facts myself.

brabel

How do you verify a fact? Do you travel to the location and interview the locals? Or read scientific papers in various fields, including their own references, to validate summaries published by news sources? At some point you need to just trust that someone is telling the truth.

latexr

I’m pretty sure what the what your parent comment means is they verify that key facts outputted by the summary match what’s written in the source.

jjtheblunt

agreed on Kagi News, and Particle News has been good, but they accepted funding from The Atlantic which evidently earns "Featured Article" positioning to articles from funding sources, muddying the clarity of biases, which Particle News has a nice graphic indicator for, though i've not seen it under promoted Feature Articles. Surely applies to other funding sources, but The Atlantic one was pretty recent.

delusional

What if the AI makes an interesting or important article sound like one you don't want to read? You'd never cross check the fact, and you'd never discover how wrong the AI was.

alcide

Integrity of words and author intent is important. I understand the intent of your hypothetical but I haven’t run into this issue in practice with Kagi News.

Never share information about an article you have not read. Likewise, never draw definitive conclusions from an article that is not of interest.

If you do not find a headline interesting, the take away is that you did not find the headline interesting. Nothing more, nothing less. You should read the key insights before dismissing an article entirely.

I can imagine AI summarizes being problematic for a class of people that do not cross check if an article is of value to them.

latexr

> I can imagine AI summarizes being problematic for a class of people that do not cross check if an article is of value to them.

I feel like that’s “the majority of people” or at least “a large enough group for it to be a societal problem”.

unshavedyak

That's fair, but i also don't cross check news sources on average either. I should, but there in lies the real problem imo. Information is war these days, and we've not yet developed tools for wading through immense piles of subtly inaccurate or biased data.

We're in a weird time. It's always been like this, it's just much.. more, now. I'm not sure how we'll adapt.

jabroni_salad

There is more written material produced every hour than I could read in a lifetime, I am going to miss 99.9999% of everything no matter what I do. It's not like the headline+blurb you usually get is any better in this regard.

null

[deleted]

cek

From the report:

> This time, we used the free/consumer versions of ChatGPT, Copilot, Perplexity and Gemini.

IOW, they tested ChatGPT twice (Copilot uses ChatGPT's models) and didn't test Grok (or others).

megaman821

It seems if half the questions are political hot button issues. While slightly interesting, this does not represent how these AIs would do on drier news items. Some of these questions are more appropriate for deep-research modes than quick answers since even legitamate news sources are filled with opinions on the actual answers.

nomilk

The layers of irony..

BBC reports that AI assistants misrepresent the news, when BBC itself is a known misrepresentor of the news, and is itself (probably) misrepresenting the news in its claim that AI assistants misrepresent the news 45% of the time.

The cherry on top is that legacy misrepresentations cause AI to perform worse (since AI is based on a range of sources, including inaccurate legacy sources).

Workaccount2

I have been unable to recreate any of the failure examples they gave. I don't have co-pilot, but at least Gemini 2.5 pro, ChatGPT5-Thinking, and Perplexity have all give the correct answers as outlined.[1]

They don't say what models they were actually using though, so it could be nano models that they asked. They also don't outline the structure of the tests. It seems rigor here was pretty low. Which frankly comes off a bit like...misrepresentation.

Edit: They do some outlining in the appendix of the study. They used GPT-4o, 2.5 flash, default free copilot, and default free perplexity.

So they used light weight and/or old models.

[1]https://www.bbc.co.uk/aboutthebbc/documents/news-integrity-i...

ashenke

They're talking about assistants, not models, so try using the gemini or perplexity app?

kibwen

"Siri, how do I know if I can trust the news summaries you give me?"

«According to the BBC, AI assistants accurately represent news content the majority of the time.»