Skip to content(if available)orjump to list(if available)

Character Prefix Conditioning

Character Prefix Conditioning

5 comments

·January 8, 2025

kcarnold

This was the subject of https://arxiv.org/abs/2412.03719. (I suspect you can do simpler than the paper's solution if you're only interested in the top-k.)

A related topic is "token healing", although some implementations (unfortunately including the one in HuggingFace Transformers) make some big assumptions that aren't always true (like treating spaces as special).

yorwba

Ideally you'd have a language model that can predict a good continuation after any byte. If an existing model can't do that because it's too reliant on a specific tokenization, you might nonetheless be able to fine-tune it until it can gracefully handle the unexpected tokenizations that result from splitting at a random byte.

kevmo314

Such a model will always be less performant than one on tokens, as you're effectively switching to one byte per token. Solving this problem in code is much cheaper.

yorwba

I don't mean switching to one byte per token, but switching to training on the token distribution that results from cutting off the input at arbitrary bytes. The bytes per token should be basically unchanged, as only the end gets a bit shorter.

teaearlgraycold

Not sure if this is free labor or a means to source candidates.