Slightly better named character reference tokenization than Chrome, Safari, FF
7 comments
·June 27, 2025masfuerte
squeek502
Good catch, that does indeed look like a mistake in the spec. Everything past the first sentence of that error description is suspect, honestly (seems like it was naively adapted from the example in [1] but that example isn't relevant to the missing-semicolon-after-character-reference error).
Will submit an issue/PR to correct it when I get a chance.
[1] https://html.spec.whatwg.org/multipage/parsing.html#named-ch...
Ndymium
Thanks to your article I just realised my HTML entity codec library doesn't support decoding those named entities that can omit the semicolon at the end. More work for me, good thing my summer vacation just started! :)
deepdarkforest
This might not get a lot of traction because it's very technical, but i wanted to say a massive well done for the effort. 20k words on anything this specific is not a joke. I wish i would put this level of commitment to anything in life, this was inspiring if nothing else.
squeek502
Appreciate it (I'm the author). I'd like to think there's a good bit of interesting stuff in here outside of the specific topic of named character reference tokenization.
chaps
"no[t] a 'data structures' person"
says the person who wrote an extremely technical 20k word blog post on data structures! <3
arthurcolle
Congratulations on your newfound promotion to data structures person btw
That was a good read. I reread the relevant section of the HTML5 spec and noticed an error in an example:
> For example, ¬in will be parsed as "¬in" whereas ¬in will be parsed as "∉".
Only a small minority of the named character references are permitted without a closing semicolon, and notin is not one of them. So ¬in is actually parsed as "¬in". ∉ is parsed as "∉".
https://html.spec.whatwg.org/#parse-error-missing-semicolon-...