Branchless UTF-8 Encoding
19 comments
·January 17, 2025koala_man
I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"
jsheard
FJCVTZS isn't really as specific to Javascript as the name suggests, it actually copies the semantics of the equivalent x86 instruction, which JS took its semantics from. ARM obviously doesn't want to name-drop x86 so they call it a JS instruction instead. JS runtimes do use it but it's also useful for x86 emulation in general.
orlp
If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:
branchless_utf8:
mov rax, rdi
lzcnt ecx, esi
lea rdx, [rip + .L__unnamed_1]
movzx ecx, byte ptr [rcx + rdx]
lea rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
pdep esi, esi, dword ptr [rdx + 8*rcx]
or esi, dword ptr [rdx + 8*rcx + 4]
movbe dword ptr [rdi], esi
mov qword ptr [rdi + 8], rcx
ret
The code: static DEP_AND_OR: [(u32, u32); 5] = [
(0, 0),
(0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
(0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
(0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
(0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
];
const LEN: [u8; 33] = [
// 0-10 leading zeros: not valid.
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
// 11-15 leading zeros: 4 bytes.
4, 4, 4, 4, 4,
// 16-20 leading zeros: 3 bytes.
3, 3, 3, 3, 3,
// 21-24 leading zeros: 2 bytes.
2, 2, 2, 2,
// 25-32 leading zeros: 1 byte.
1, 1, 1, 1, 1, 1, 1, 1,
];
pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
let leading_zeros = codepoint.leading_zeros() as usize;
let bytes = LEN[leading_zeros] as usize;
let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
(ret.swap_bytes().to_le_bytes(), bytes)
}
null
xeeeeeeeeeeenu
> So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”.
Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.
pklausler
Easy to count leading zeroes in a branch-free manner without a hardware instruction using a conditional move and a de Bruijn sequence; see https://github.com/llvm/llvm-project/blob/main/flang/include... .
sltkr
You can also very trivially do (codepoint | 1).leading_zeros(), then you can also shave one byte off the LEN table. (This doesn't affect the result because LEN[32] == LEN[33] == 1).
Arnavion
>So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!
Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.
I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh
Validark
I wrote this stub a while back
https://gist.github.com/Validark/457b6db8aa00ded26a6681d4d25...
RenThraysk
Or-ing 1 onto codepoint before calling leading_zeroes() should get a decent compiler to remove the branch.
Dwedit
Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?
ngoldbaum
You could do two passes over the string, first get the total length in bytes, then fill it in codepoint by codepoint.
You could also pessimistically over-allocate assuming four bytes per character and then resize afterwards.
With the API in the linked blog post it's up to the user to decide how they want to use the output [u8;4] array.
lxgr
> Compiler explorer confirms that, with optimizations enabled, this function is branchless.
Only if you don't consider conditional move instructions branching/cheating :)
ThatGuyRaion
So is this potentially performance improving?.
not2b
Usually people are interested in branchless implementations for cryptography applications, to avoid timing side channels (though you then have to make sure that the generated instructions don't have different timing for different input values), and will pay some time penalty if they have to.
PhilipRoman
Last time I tested branchless UTF-8 algorithms, I came to the conclusion that they only perform [slightly] better for text consisting of foreign multibyte characters. Unless you expect lots of such inputs on the hot path, just go with traditional algorithms instead. Even in the worst case the difference isn't that big.
Sometimes people fail to appreciate how insanely fast a predictable branch really is.
jan_haker
[flagged]
Incidentally, this automatic branch-if-zero from LLVM is being improved.
First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:
https://github.com/llvm/llvm-project/pull/102885
Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).
If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.
Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.
Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.
(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")