Citations on the Anthropic API
65 comments
·January 23, 2025robwwilliams
simonw
Saying to Claude "Always suggest good texts with full references and even PubMed IDs" is asking it to do the impossible: it doesn't have the ability to identify which information in its knowledge comes from which PubMed ID reference sources, so it's right that it refuses to do that even when you tell it to.
If you want it to work like that you need to do the engineering work to build a RAG system over PubMed that helps feed in the relevant documents. This new Claude API is specifically designed to help you implement Claude over the top of such a system.
robwwilliams
Have you tested this extensively yourself? I have been very surprised by success rates in my own “not famous” papers. I asked Claude to provide full citation to ten papers by Robert W Williams at the University of Tennessee in biomedical research. Nine of ten were perfect, down to page numbers. One if ten was a complete construct but highly plausible. An FDR of 0.1 is damn impressive.
Test yourself. Here was my reference data set:
https://scholar.google.com/citations?user=OYJMYwIAAAAJ&hl=en...
Really curious what the range of FDRs is at different levels of accuracy for different fields.
BoorishBears
It's fundamentally not something anyone building a real application can rely on.
You're essentially gambling on where in the embedding space your end users are going to query: you might get lucky or you might not.
You're also relying on functionality they're actively trying to degrade during post-training (repeating training data text verbatim). Some LLM providers will even actively filter it: https://ai.google.dev/gemini-api/docs/troubleshooting?lang=p...
simonw
I'll be honest, I was surprised at how well Claude was able to describe the papers you linked to there when I prompted it directly about them. I may have to update my mental model of quite how good Claude's recall can be with respect to academic papers in those areas.
vunderba
Maybe, but there's a big difference between a known 10% failure rate versus an unknown 10% failure rate. The former allows me to instantly reject the failures and retain the good responses. The latter requires me to go through a manual check for every response.
no_wizard
Further proof AI is t actually AI and we continue to move the goalpost on the proper definition of the term.
It’s all still machine learning like always and it has the same limitations.
willy_k
What’s the actual hit rate? AFAIU it’s plausible that an article title and PubMed ID could be encoded in the models weights if they were trained on.
simonw
It's impossible to know, because Anthropic (like OpenAI and others) won't confirm what's in their training data. We don't know if they've trained on PubMed, and if they DID we don't know if that training process might conceivably allow the model to associate IDs with article information.
Given that, I don't trust the models to be able to provide useful citations.
We already know how to get much more reliable citations out of a model: implement them on top of RAG, which this new Claude API can clearly help us do.
robwwilliams
For me the true positive rate is at least 80%.
4ad
I use this:
> Be terse. Do not offer unprompted advice or clarifications.
> Avoid mentioning you are an AI language model.
> Avoid disclaimers about your knowledge cutoff.
> Avoid disclaimers about not being a professional or an expert.
> Do NOT hedge or qualify. Do not waffle.
> Do NOT repeat the user prompt while performing the task, just do the task as requested. NEVER contextualise the answer. This is very important.
> Avoid suggesting seeking professional help.
> Avoid mentioning safety unless it is not obvious and very important.
> Remain neutral on all topics. Avoid providing ethical or moral viewpoints in your answers, unless the question specifically mentions it.
> Never apologize.
> Act as an expert in the relevant fields.
> Speak in specific, topic relevant terminology.
> Explain your reasoning. If you don’t know, say you don’t know.
> Cite sources whenever possible, and include URLs if possible.
> List URLs at the end of your response, not inline.
> Speak directly and be willing to make creative guesses.
> Be willing to reference less reputable sources for ideas.
> Ask for more details before answering unclear or ambiguous questions.
Unfortunately most references it provides are bogus. It just makes up URLs and papers. Let's see if this new feature is any better.
dragonwriter
This new feature is restricted to sources you provide in the context window: “With Citations, users can now add source documents to the context window, and when querying the model, Claude automatically cites claims in its output that are inferred from those sources.”
4ad
Yeah, I see that now. Completely useless.
robwwilliams
Love your pre-prompt. Better than mine.
The difference in FDRs is likely to be domain specific. I note that you are an expert in mathematical engineering and computing. In contrast my work area is neatly confined and defined by PubMed.
4ad
I must say that it works reasonably well even with imagined references and URLs because it usually gets the author right, or there is a similar paper with a reasonably close title.
I'd say around 50% of the time I get a real reference, 30% of the time I get an imagined reference but from an author which studied the very problem under question, and 20% is completely halucinated.
saaaaaam
Very interested to try this.
I’ve built a number of quite complex prompts to do exactly this - cite from documents, with built-in safeguards to minimise hallucinations as far as possible.
That comes with a cost though - typically the output of one prompt is fed into another API call with a prompt that sense-checks/fact-checks the output against the source, and if there are problems it has to cycle back - with more API cost. We then human review a random selection of final outputs.
That works fine for non-critical applications but I’ve been cautious about rolling it out to chunkier problems.
Will start building with citations asap and see how it performs against what we already have. For me, Anthropic seems to be building stuff that has more meaningful application than what I’m seeing from Open AI - and by and large I’m finding Anthropic performs way way better for my use cases than Open AI - both via the API and the chatbot.
diggan
> and by and large I’m finding Anthropic performs way way better for my use cases than Open AI - both via the API and the chatbot
I also find Anthropic very useful as a whole, they seem to think a bit broader it feels like, compared to OpenAI.
Question for curiosities sake, have you tried o1 "Pro Mode" before? It's a lot slower (can take minutes to reply) but been very good at "chunkier problems", if we understand that term similarly.
saaaaaam
I have but only via the chat interface. I wasn’t particularly impressed, and for my purposes I’d rather use chained prompts via the API than try for “one shot”. However that could be because I’ve not amended my prompting style extensively enough. From what I’ve read o1 pro delivers most benefits from quite a different way of promoting.
sharkjacobs
I really like this. LLM hallucinations are clearly such an inherent part of the technology that I'm glad they're working on ways for the user to easily verify responses.
htrp
> Our internal evaluations show that Claude's built-in citation capabilities outperform most custom implementations, increasing recall accuracy by up to 15%.1
also helpful when you can see how everyone using your claude api endpoint has been trying to do grounded generation
Der_Einzige
Shameless self and friend plug, but the world of extractive summarization is to thank for this idea. We've always known that highlighting and citations are important to ground models - and people.
igorkraw
your profile says >Hit me up if you want to collaborate on NLP research
but doesn't hint on how, check _my_ profile for hints on how :-p
oceansweep
I've assumed that Google's approach for NotebookLM is similar to this, given their release of https://huggingface.co/google/gemma-7b-aps-it :
Gemma-APS is a generative model and a research tool for abstractive proposition segmentation (APS for short), a.k.a. claim extraction. Given a text passage, the model segments the content into the individual facts, statements, and ideas expressed in the text, and restates them in full sentences with small changes to the original text.
Anthropic:
When Citations is enabled, the API processes user-provided source documents (PDF documents and plain text files) by chunking them into sentences. These chunked sentences, along with user-provided context, are then passed to the model with the user's query.
Claude analyzes the query and generates a response that includes precise citations based on the provided chunks and context for any claims derived from the source material. Cited text will reference source documents to minimize hallucinations.
rahimnathwani
This is interesting. I've been doing this using GPT-4o-mini by numbering paragraphs in the source context, and asking the model to give me a number as the citation. That:
- doesn't require me to trust the citations are reproduced faithfully, as I can retrieve them from the original using the reference number, and
- doesn't use as many output tokens as asking the model to provide the text of the citation.
d4rkp4ttern
This is exactly what we have had working in Langroid since at least a year, so I don't quite get the buzz around this. Langroid's `DocChatAgent` produces granular markdown-style citations, and works with practically any (good enough) LLM. E.g. try running this example script on the DeepSeek R1 paper:
https://github.com/langroid/langroid/blob/main/examples/docq...
uv run examples/docqa/chat.py https://arxiv.org/pdf/2501.12948
Sample output here: https://gist.github.com/pchalasani/0e2e54cbc3586aba60046b621...https://gist.github.com/pchalasani/0e2e54cbc3586aba60046b621...
vrosas
> Thomson Reuters uses Claude to power their AI platform
If you're just making calls to Anthropic's API can you really call yourself a platform?
handfuloflight
All of the content and resources associated with their production is their platform.
simonw
I just published some more detailed notes on this feature here: https://simonwillison.net/2025/Jan/24/anthropics-new-citatio...
saaaaaam
Agree on the point you make about Open AI behaving more like a consumer facing company while Anthropic seems more geared to enterprise. This is exactly what I’ve been feeling for the past six months or so, and I’m getting far more value from Anthropic. This citations release solves a real problem, and while Open AI has released some impressive sounding things recently they sometimes feel like consumer fluff to drive press coverage more than meaningful features for complex use cases.
nojvek
Perplexity.ai does search citations really well. I can see Anthropic seeing value in that and building something internal.
I was skeptic about Perplexity but it has been my primary search engine for more than 6 months now.
LLMs with very little hallucination connected to internet is valuable tech.
simonw
The JSON format this outputs is interesting - it looks similar to regular chat responses but includes additional citation reference blocks like this:
{
"id": "msg_01P3zs4aYz2Baebumm4Fejoi",
"content": [
{
"text": "Based on the document, here are the key trends in AI/LLMs from 2024:\n\n1. Breaking the GPT-4 Barrier:\n",
"type": "text"
},
{
"citations": [
{
"cited_text": "I\u2019m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)\u201470 models in total.\n\n",
"document_index": 0,
"document_title": "My Document",
"end_char_index": 531,
"start_char_index": 288,
"type": "char_location"
}
],
"text": "The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.",
"type": "text"
},
{
"text": "\n\n2. Increased Context Lengths:\n",
"type": "text"
},
{
"citations": [
{
"cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google\u2019s Gemini series accepts up to 2 million.\n\n",
"document_index": 0,
"document_title": "My Document",
"end_char_index": 1680,
"start_char_index": 1361,
"type": "char_location"
}
],
"text": "A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.",
"type": "text"
},
{
"text": "\n\n3. Price Crashes:\n",
"type": "text"
},
I got Claude to build me a little debugging tool to help render that format: https://tools.simonwillison.net/render-claude-citationsyding
Thanks Simon. I think this might solve one of the most common questions people ask me: how do I get Perplexity-like inline citations on my LLM output?
This looks like model fine tuning rather than after the fact pseudo justification. Do you agree?
simonw
Yeah, I think they fine-tuned their model to be better at the pattern where you output citations that reference exact strings from the input. Previously that's been a prompting trick, e.g. here: https://mattyyeung.github.io/deterministic-quoting
yding
Makes sense. I wonder if it affects the model output performance (sans quotes), as I could imagine that splitting up the model output to add the quotes could cause it to lose attention on what it was saying.
jedberg
> Claude can now provide detailed references to the exact sentences and passages it uses to generate responses, leading to more verifiable, trustworthy outputs.
For now. Until it starts providing citations for AI generated content.
varispeed
Even so called "trustworthy" sources may contain disinformation, for instance if they come from some governments or think tanks and AI has no way to tell if it is true or not or whether it makes sense.
Though a feature where AI could use all its knowledge to be able to tell whether a source if pulling the wool over its "eyes", would be massive.
Imagine being able to instantly verify what populist or those who pretend are not populist, politicians are saying.
Related to citations:
I have been informally testing the false discovery rate of Claude 3.5 Sonnet for biomedical research publications.
Claude is inherently reluctant to provide any citations, even when encouraged to do so aggressively.
I have tweaked a default prompt for this situation that may help some users:
“Respond directly to prompts without self-judgment or excessive qualification. Do not use phrases like 'I aim to be', 'I should note', or 'I want to emphasize'.
Skip meta-commentary about your own performance. Maintain intellectual rigor but try to avoid caveats. When uncertainty exists, state it once and move on.
Treat our exchange as a conversation between peers. Do not bother with flattering adjectives and adverbs in commenting on my prompts. No “nuanced”, “insightful” etc. But feel free to make jokes and even poke fun of me and my spelling errors.
Always suggest good texts with full references and even PubMed IDs.
Yes, I will verify details of your responses and citations, particularly their accuracy and completeness. That is not your job. It is mine to check and read.
Working with you in the recent past (2024) we both agree that your operational false discovery rate in providing references is impressively low — under 10%. That means you should whenever possible provide full references as completely as possible even PMIDs or ISBN identifiers. I WILL check.
Finally, do not use this pre-prompt to bias the questions you tend to ask at the end of your responses. Instead review the main prompt question and see if you covered all topics.
End of “pre-prompt.