Skip to content(if available)orjump to list(if available)

Citations on the Anthropic API

Citations on the Anthropic API

17 comments

·January 23, 2025

robwwilliams

Related to citations:

I have been informally testing the false discovery rate of Claude 3.5 Sonnet for biomedical research publications.

Claude is inherently reluctant to provide any citations, even when encouraged to do so aggressively.

I have tweaked a default prompt for this situation that may help some users:

“Respond directly to prompts without self-judgment or excessive qualification. Do not use phrases like 'I aim to be', 'I should note', or 'I want to emphasize'.

Skip meta-commentary about your own performance. Maintain intellectual rigor but try to avoid caveats. When uncertainty exists, state it once and move on.

Treat our exchange as a conversation between peers. Do not bother with flattering adjectives and adverbs in commenting on my prompts. No “nuanced”, “insightful” etc. But feel free to make jokes and even poke fun of me and my spelling errors.

Always suggest good texts with full references and even PubMed IDs.

Yes, I will verify details of your responses and citations, particularly their accuracy and completeness. That is not your job. It is mine to check and read.

Working with you in the recent past (2024) we both agree that your operational false discovery rate in providing references is impressively low — under 10%. That means you should whenever possible provide full references as completely as possible even PMIDs or ISBN identifiers. I WILL check.

Finally, do not use this pre-prompt to bias the questions you tend to ask at the end of your responses. Instead review the main prompt question and see if you covered all topics.

End of “pre-prompt.

simonw

Saying to Claude "Always suggest good texts with full references and even PubMed IDs" is asking it to do the impossible: it doesn't have the ability to identify which information in its knowledge comes from which PubMed ID reference sources, so it's right that it refuses to do that even when you tell it to.

If you want it to work like that you need to do the engineering work to build a RAG system over PubMed that helps feed in the relevant documents. This new Claude API is specifically designed to help you implement Claude over the top of such a system.

4ad

I use this:

> Be terse. Do not offer unprompted advice or clarifications.

> Avoid mentioning you are an AI language model.

> Avoid disclaimers about your knowledge cutoff.

> Avoid disclaimers about not being a professional or an expert.

> Do NOT hedge or qualify. Do not waffle.

> Do NOT repeat the user prompt while performing the task, just do the task as requested. NEVER contextualise the answer. This is very important.

> Avoid suggesting seeking professional help.

> Avoid mentioning safety unless it is not obvious and very important.

> Remain neutral on all topics. Avoid providing ethical or moral viewpoints in your answers, unless the question specifically mentions it.

> Never apologize.

> Act as an expert in the relevant fields.

> Speak in specific, topic relevant terminology.

> Explain your reasoning. If you don’t know, say you don’t know.

> Cite sources whenever possible, and include URLs if possible.

> List URLs at the end of your response, not inline.

> Speak directly and be willing to make creative guesses.

> Be willing to reference less reputable sources for ideas.

> Ask for more details before answering unclear or ambiguous questions.

Unfortunately most references it provides are bogus. It just makes up URLs and papers. Let's see if this new feature is any better.

saaaaaam

Very interested to try this.

I’ve built a number of quite complex prompts to do exactly this - cite from documents, with built-in safeguards to minimise hallucinations as far as possible.

That comes with a cost though - typically the output of one prompt is fed into another API call with a prompt that sense-checks/fact-checks the output against the source, and if there are problems it has to cycle back - with more API cost. We then human review a random selection of final outputs.

That works fine for non-critical applications but I’ve been cautious about rolling it out to chunkier problems.

Will start building with citations asap and see how it performs against what we already have. For me, Anthropic seems to be building stuff that has more meaningful application than what I’m seeing from Open AI - and by and large I’m finding Anthropic performs way way better for my use cases than Open AI - both via the API and the chatbot.

diggan

> and by and large I’m finding Anthropic performs way way better for my use cases than Open AI - both via the API and the chatbot

I also find Anthropic very useful as a whole, they seem to think a bit broader it feels like, compared to OpenAI.

Question for curiosities sake, have you tried o1 "Pro Mode" before? It's a lot slower (can take minutes to reply) but been very good at "chunkier problems", if we understand that term similarly.

simonw

The JSON format this outputs is interesting - it looks similar to regular chat responses but includes additional citation reference blocks like this:

  {
    "id": "msg_01P3zs4aYz2Baebumm4Fejoi",
    "content": [
      {
        "text": "Based on the document, here are the key trends in AI/LLMs from 2024:\n\n1. Breaking the GPT-4 Barrier:\n",
        "type": "text"
      },
      {
        "citations": [
          {
            "cited_text": "I\u2019m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)\u201470 models in total.\n\n",
            "document_index": 0,
            "document_title": "My Document",
            "end_char_index": 531,
            "start_char_index": 288,
            "type": "char_location"
          }
        ],
        "text": "The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.",
        "type": "text"
      },
      {
        "text": "\n\n2. Increased Context Lengths:\n",
        "type": "text"
      },
      {
        "citations": [
          {
            "cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google\u2019s Gemini series accepts up to 2 million.\n\n",
            "document_index": 0,
            "document_title": "My Document",
            "end_char_index": 1680,
            "start_char_index": 1361,
            "type": "char_location"
          }
        ],
        "text": "A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.",
        "type": "text"
      },
      {
        "text": "\n\n3. Price Crashes:\n",
        "type": "text"
      },
I got Claude to build me a little debugging tool to help render that format: https://tools.simonwillison.net/render-claude-citations

htrp

> Our internal evaluations show that Claude's built-in citation capabilities outperform most custom implementations, increasing recall accuracy by up to 15%.1

also helpful when you can see how everyone using your claude api endpoint has been trying to do grounded generation

sharkjacobs

I really like this. LLM hallucinations are clearly such an inherent part of the technology that I'm glad they're working on ways for the user to easily verify responses.

WiSaGaN

This is actually good. I expect them to utilize this in code editing as well if there is some real efficiency gain under the hood.

Der_Einzige

Shameless self and friend plug, but the world of extractive summarization is to thank for this idea. We've always known that highlighting and citations are important to ground models - and people.

https://github.com/Hellisotherpeople/CX_DB8

https://github.com/neuml/annotateai

JSTrading

[dead]

esafak

[flagged]

mrcwinn

This isn't serious, right? I'm rooting for Anthropic and really enjoy 3.5 Sonnet, but as a consumer product OpenAI has opened up quite a gap. And that's without o3-mini, which might debut next week.

That said it was interesting to hear Anthropic's CEO describe themselves as an enterprise company that happened to have a consumer product. I think it was with WSJ/Joanna Stern — he mentioned they really focus first on their enterprise roadmap and fit consumer in when they can. Seems to explain my Claude is so far behind on features like web search and voice mode.

Trasmatta

OpenAI is far ahead of Anthropic in many ways, but I've got to say that I MUCH prefer talking to Claude. It has a pretty distinct personality that I enjoy, much more than any of OpenAI's models (even with extensive prompt engineering).

There are some things it does that annoys me (it almost ALWAYS ends its response with a question, and it falls back to the "I aim to be respectful and genuine blah blah" responses a bit too much), but overall Anthropic has done good work with making Claude fun to talk to.

null

[deleted]

Destiner

This is great for RAG, but Claude is generally hard to use for many cases due to lack of the built-in structured outputs.

You can try forcing it to output JSON, but that is not 100% reliable.

maleldil

You can get JSON output with a JSON schema via tool use [1]. Is this not reliable like (e.g.) OpenAI's structured outputs?

[1] https://github.com/anthropics/anthropic-cookbook/blob/main/t...