HTML as an Accessible Format for Papers

81 comments

·December 6, 2025

ComputerGuru

If the Unicode consortium would spend less time and effort on emoji and more on making the most common/important mathematical symbols and notations available/renderable in plain text, maybe we could move past the (LA)TeX/PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.

I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).

An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20

hannahnowxyz

Have you tried a two-pass approach? For example, where prompt #1 is "Which elliptic curves have rational parameterizations?", and then prompt #2 (perhaps to a smaller/faster model like Gemma) is "In the following text, replace all LaTeX-escaped notation with Markdown code blocks and unicode characters. For example, $F_n = F_{n - 1} + F_{n - 2}$ should be replaced with `Fₙ = Fₙ₋₁ + Fₙ₋₂`. <Response from prompt #1>". Although it's not clear how you would want more complex things to be converted.

ForceBru

Is this new or somehow updated? HTML versions of papers have been available for several years now.

EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...

Tagbert

From the paper...

Why "experimental" HTML?

Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices. In addition to the technical challenges, the conversion must be both rapid and automated in order to maintain arXiv’s core service of free and fast dissemination.

ForceBru

No I mean _arXiv_ has had experimental support for generating HTML versions of papers for years now. If you visit arXiv, you'll see a lot of papers have generated HTML alongside the usual PDF, so I'm trying to understand whether the article discussed any new developments. It seems like it's not new at all

daemonologist

There are pretty often problems with figure size and with sections being too narrow or wide (for comfortable reading). The PDF versions are more consistently well-laid-out.

inglor

You're right https://github.com/arXiv/arxiv-docs/blob/develop/source/abou... this needs a 2023 tag @dang

DominikPeters

As an arXiv author who likes using complicated TeX constructions, the introduction of HTML conversion has increased my workload a lot trying to write fallback macros that render okay after conversion. The conversion is super slow and there is no way to faithfully simulate it locally. Still I think it's a great thing to do.

xworld21

I believe dginev's Docker image https://github.com/dginev/ar5ivist is very close to what runs on arXiv and can be run locally. It uses a recent LaTeXML snapshot from September.

percentcer

Dumb question but what stops browsers from rendering TeX directly (aside from the work to implement it)? I assume it's more than just the rendering

bo1024

You mean a display engine that works like an HTML renderer, except starting from TeX source instead of HTML source? I think you could get something that mostly works, but it would be a pain and at the end you wouldn't have CSS or javascript, so I don't think browser makers are interested.

pwdisswordfishy

For starters, TeX is Turing-complete, and the tokenizer is arbitrarily reprogrammable at runtime.

ekjhgkejhgk

I wish epub was more common for papers. I have no idea if there's any real difficulties with that, or just not enough demand.

mmooss

epub is html, under the hood

Is there an epub reader that can format text approximately as usably and beautifully as pdf? What I've seen makes it noticeably harder to read longer texts, though I haven't looked around much.

epub also lacks annotation, or at least annotation that will be readable across platforms and time.

hombre_fatal

Because what makes epub a format on top of html is just that someone QA'ed it and wrote the html/css with it in mind. Especially considering things like diagrams and tables.

Not really what you want researchers to waste their time doing.

But you can use any of the numerous html->epub packagers yourself.

pspeter3

Why epub? Isn’t it just HTML under the hood?

ekjhgkejhgk

Because I can open it on my ereader.

el3ctron

Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.

lalithaar

Hello, I was going through html versions of my preprints on Arxiv, thank you for all that you guys do Please do let me know if the community could contribute through any means for the same

leobg

It must have been around 1998. I was editor of our school’s newspaper. We were using Corel Draw. At some point, I proposed that we start using HTML instead. In the end, we decided against it, and the reasons were the same that you can read here in the comments now.

Barbing

>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.

Challenging. Good work!

teddy-smith

It's extremely easy to convert HTML/CSS to a PDF with the print to PDF feature of the browser.

All papers should be in HTML/CSS or Tex then just simply converted to PDF.

Why are we even talking about this?

tefkah

What are you talking about? No one’s writing their paper in HTML.

The problem is having the submissions be in TeX and converting that to HTML, when the only output has been PDF for so long.

The problem isn’t converting HTML to PDF, it’s making available a giant portion of TeX/pdf only papers in HTML.

If you’re arguing that maybe TeX then shouldn’t be the source format for papers then I agree, but other than Typst (which also isn’t perfect about HTML output yet) there aren’t that many widely accepted/used authoring formats for physics/math papers, which is what ArXiV primarily hosts.

ekjhgkejhgk

LOL what. You're either trolling, or you've never written a paper in your life.

nkrisc

So, uh, where do the HTML versions of the papers come from?

carlosjobim

Except you can't have page breaks, three links in a row, anchor links.

benatkin

It's easy to convert PDF to HTML/CSS, with similar results.

Either way it gets shoehorned.

sega_sai

Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.

sundarurfriend

Do the older papers work via [Ar5iv](https://ar5iv.labs.arxiv.org/) ?

> View any arXiv article URL [in HTML] by changing the X to a 5

The line

> Sources upto the end of November 2025.

sounds to me like this is indeed intended for older articles.

jas39

Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful

stephenlf

That doesn’t solve the accessibility issue, though. You need semantic tags.

billconan

I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.

the actual paper content format should be separated from its rendering.

i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.

the viewer platforms then should be able to style the content differently.

cluckindan

HTML alone is in fact not a format for displaying/rendering. Done properly, it is a structural representation of the content. (This is often called ”semantic HTML”.)

They are converting to HTML to make the content more accessible. Accessibility in this context means a11y, in effect ”more accessible” equates to ”more compatible with screen readers”.

While PDF documents can be made accessible, it is way easier to do it in HTML, where browsers build an actual AOM (accessibility object model) tree and expose it to screen readers.

>it should contain abstract, sections, equations, figures, citations etc.

So <article>, <section>, <math>, <figure>, <cite>, etc.

benatkin

Much of it is a structural representation of how to display the content.

m-schuetz

That's a purist stance that's never going to work out in praxtice. Authors will always want to adjust the presentation of content, and html might be even better suited for that than Latex, which as bad at both.

dimal

Perfect is the enemy of good. HTML is good enough. Let’s get this done.

And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.

billconan

mixing rendering definitions with content (PDF) is something from the printer era, that is unsuitable for the digital era.

HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn't need.

for research papers, since they share the same structure, we can further separate content from rendering.

for example, if you want to later connect a paper with an AI, do you want to send <div class="abstract"> ... ?

or do some nasty heuristic to extract the abstract? like document. getElementsByClassName("abstract")[0] ?

simonw

All of the interesting LLMs can handle a full paper these days without any trouble at all. I don't think it's worth spending much time optimizing for that use-case any more - that was much more important two years ago when most models topped out at 4,000 or 8,000 tokens.

bob1029

> HTML is better than PDF

I disagree. PDF is the most desirable format for printed media and its analogues. Any time I plan to seriously entertain a paper from Arxiv, I print it out first. I prefer to have the author's original intent in hand. Arbitrary page breaks and layout shifts that are a result of my specific hardware/software configuration are not desirable to me in this context of use.

ACCount37

I agree that PDF is best for things that are meant to be printed, no questions. But I wonder how common actually printing those papers is?

In research and in embedded hardware both, I've met some people who had entire stacks of papers printed out - research papers or datasheets or application notes - but also people who had 3 monitors and 64GB of RAM and all the papers open as browser tabs.

I'm far closer to the latter myself. Is this a "generational split" thing?

pfortuny

Possibly, but then again, when I need to study a paper, I print it, when I need just to skim it and use a result from it, it is more likely that I just read it on a screen (tablet/monitor). That is the difference for me.

s0rce

I used to print papers, probably stopped about 10 years ago. I now read everything in Zotero where I can highlight and save my annotations and sync my library between devices. You can also seamlessly archive html and pdfs. I don't see people printing papers in my workplace that often unless you need to read them in a wet lab where the computer is not convenient.

afavour

Wouldn’t that be CSS?

billconan

<pre><code> abstract text ... </code></pre>

</div>

<ol>

<li>author one</li>

<li>author two</li>

<ol>

</div>

should be just:

[abstract]

abstract text

[authors]

author one | email | affiliation

author two | email | affiliation

afavour

Sounds like XML and XSL would be a great fit here. Shame it’s being deprecated.

But you could still use HTML. Elements with a dash in are reserved for custom elements (that is, a new standardised element will never take that name) so you could do:

    <paper-author-list>
      <paper-author />
    </paper-author-list>

And it would be valid HTML. Then you’d style it with CSS, with

    paper-author {
      display: list-item;
    }

And so on.

panzi

There is <article> <section> <figure> <legend>, but yes, <abstract> and <authors> is missing as such. But there are meta tags for such things. Then there is RDF and Thing. Not quite the same, I know, but it's not completely useless.

HN

HTML as an Accessible Format for Papers

HTML as an Accessible Format for Papers