RT64: N64 graphics renderer in emulators and native ports
47 comments
·February 20, 2025terhechte
devit
Doesn't seem to be a noticeable improvement from N64-era graphics.
It probably needs generative AI based upscaling to high resolution meshes, textures and realistic materials to actually achieve a quality improvement.
dcrazy
The improvement is in the shadows. Link’s real-time shadow on the ground, the self-shadowing of the roof of the house that Luigi emerges from, the shadows the ropes cast on Kirby… the N64 could not do these things even at coarse resolution due to limited texture bandwidth.
basch
It would be an interesting feedback loop to capture a screenshot of a game (sm64), determine the green texture is grass, update the texture realtime, take another screenshot, and iterate, having the game flesh itself out realtime as it’s being played.
kiicia
Four years ago there was demo of exactly that in GTA5 https://www.ign.com/articles/experimental-ai-tool-makes-gta-...
yieldcrv
> Doesn't seem to be a noticeable improvement from N64-era graphics
right, because that's the not point, the point is to be like N64-era graphics at 60fps and greater, with widescreen, motion blur, and other things that have high engagement amongst gamers (whether the self-proclaimed gamers like them or not)
thecaio
thank you! so needed. there is a special place in hell for folks that don't attach screenshots to fundamentally graphic projects.
robin_reala
I guess the big test will be running Kaze Emanuar’s code through it.
01HNNWZ0MV43FF
For those who haven't heard, Kaze is the guy who optimized Mario 64's code, fixed some physics bugs, etc., claims roughly 6x speedup, so that he can make a mod / ROM-hack that runs smoother despite having better graphics: https://www.youtube.com/watch?v=t_rzYnXEQlE
bityard
Has he actually released it yet?
emchammer
Just about a Fabrice Bellard level of competence there
immibis
And discovered you could disable cache writeback to have a region of fast RAM without bus contention, and ways to use the hidden 9th bit in each byte which is only usd by certain graphics framebuffer operations.
I don't think he uses either one in serious code, but if he did, good luck emulating it.
dmonitor
He demos his code on both console and emulator, so there shouldn't be an issue in that regard.
mouse_
> Uses ubershaders to guarantee no stutters due to pipeline compilation.
I may sound way out of the loop here, but... How come this was never a problem for older dx9/dx11/GL games and emulators?
tgtweak
In those gl/dx games (built for non-specific hardware) all the textures and shaders are compiled either during the game's build OR before you get into the scene. Many console systems, particularly Nintendo, do that precompilation specifically for the hardware GPU that is inside the console. That is not known to the emulator in advance (unless someone publishes a shader compilation alongside the rom...) so when the shader is referenced in the scene, it needs to be compiled in runtime to work on the emulated graphics system (translated from nintendo-hardware shader code to direct-x, vulkan or openGL then further into the vendor-specific shader)
Most modern emulators implement a shader cache which stores those shaders as they are encountered so that this "compilation stutter" only happens once per shader - but modern titles can have hundreds or thousands of shaders and that means on a playthrough you're pretty much encountering it consistently. Breath of the Wild was one that stands out as a game where you basically had to run it with precompiled shader caches as it was borderline unplayable without it.
Ubershaders act like fallback shaders - using an off the shelf precompiled "particle" shader vs the actual one, while the actual one is compiled for use next time - this prevents the stutter at a cost of visual fidelity. If you see an explosion in a game, it will be a generic explosion shader vs the actual one used in the game, until it is available in the shader cache.
zeta0134
That's not quite how ubsershaders work. They're a "fallback" shader in the sense that they rather inefficiently implement the entire pipeline, but they do implement the entire pipeline. The shader being compiled in another thread will be more efficient as it uses only the logic needed for whatever configuration the game is calling up. But the visual result is identical in the ubsershaders case, that's the whole point. If you want, and your host system is powerful enough, you can turn ubsershaders on all the time and disable the entire compilation thread and associated cache.
I believe the term was coined by Dolphin team, who did a pretty good high level writeup of the feature here:
tom_
It predates dolphin's use of it, though don't ask me by how long. Here's a reference to the term from 2008: https://realtimecollisiondetection.net/blog/?p=73
derefr
So how about:
1. A global, networked shader cache — where when any instance of the emulator encounters a new shader, it compiles it, and then pushes the KV-pair (ROM hash, target platform, console shader object-code hash)=(target-platform shader object-code) into some KV server somewhere; and some async process comes along periodically to pack all so-far-submitted KV entries with a given (ROM hash, target platform) prefix into shader-cache packfiles. On first load of a game, the emulator fetches the packfile if it exists, and loads the KV pairs from it into the emulator's local KV cache. (In theory, the emulator could also offer the option to fetch global-shader-cache-KV-store "WAL segment" files — chunks of arbitrary global-shader-cache KV writes — as they're published on a 15-minute-ly basis. Or KV entries for given (ROM hash, target) prefixes could be put into message-queue topics named after those prefixes, to which running instances of the emulator could subscribe. These optimizations might be helpful when e.g. many people are playing a just-released ROMhack, where no single person has yet run through the whole game to get it in the cache yet. Though, mind you, the ROMhack's shaders could already be cached into the global store before release, if the ROMhacker used the emulator during development... or if they knew about this, and were considerate enough to use some tool created by the emulator dev to explicitly compile + submit their raw shader project files into the global KV store.)
2. Have the emulator (or some separate tool) "mine out" all the [statically-specified] shaders embedded from the ROM, as a one-time process. (Probably not just a binwalk, because arbitrary compression. Instead, think: a concolic execution of the ROM, that is looking for any call to the "load main-memory region into VRAM as shader" GPU instruction — where there is a symbolically-emulated memory with regions that either have concrete or abstract values. If the RAM region referenced in this "load as shader" instruction is statically determinable — and the memory in that region has a statically-determinable value on a given code-path — then capture that RAM region.) Precompile all shaders discovered this way create a "perfect" KV cachefile for the game. Publish this into a DHT (or just a central database) under the ROM's hash. (Think: OpenSubtitles.org)
Mind you, I think the best strategy would actually combine the two approaches — solution #2 can virtually eliminate stutter with a single pre-processing step, but it doesn't allow for caching of dynamically-procedurally-generated shaders. Solution #1 still has stutter for at least one player, one time, for each encountered shader — but it handles the case of dynamic shaders.
tgtweak
The issue with "mining" the rom for shaders is they're not defined in a consistent way across the games. Games using the same base engine could have their shaders stored and referencable in a consistent way - but that would not be a catch-all solution for all games, which means you still need a fallback. Most of the ROMs are obfuscated and it's not possible, at least not currently, to "know" all the shaders being used in the game before the hardware calls to load the shader come through the game binary while it's running.
The best experience so far is downloading an additional shader cache alongside the ROM - in some rom formats these can also be included alongside the ROM which acts like a dictionary and can facilitate loading into the emulator vs having to add it as a "mod" for that specific game. Adding this to a DHT type network for "shader sharing" would be great but might open the door to some abuse (shaders are run at hardware level and there are some examples of malicious shaders out there) - Plus you'd be exposing the games you're playing to the dht network.
Anyway - Just a succinct example of the level of effort that goes into making an emulator "just work".
dcrazy
It was, in fact, a problem. DX11 and earlier tried to solve it with DXBC, an intermediate bytecode format that all drivers could consume. The driver would only need to lower the bytecode to the GPU’s ISA, which is much faster than a full compilation from HLSL. Prior to the emergence of Vulkan, OpenGL didn’t ever try to solve this; GLSL was always the interface to the driver. (Nowadays SPIR-V support is available to OpenGL apps if your driver implements the GL_ARB_gl_spirv extension, and of course DXIL has replaced DXBC.)
Compilation stutters was perhaps less noticeable in the DX9/OpenGL 3 era because shaders were less capable, and games relied more on fixed functionality which was implemented directly in the driver. Nowadays, a lot of the legacy API surface is actually implemented by dynamically written and compiled shaders, so you can get shader compilation hitches even when you aren’t using shaders at all.
In the N64 era of consoles, games would write ISA (“microcode”) directly into the GPU’s shared memory, usually via a library. In Nintendo’s case, SGI provided two families of libraries called “Fast3D” and “Turbo3D”. You’d call functions to build a “display list”, which was just a buffer full of instructions that did the math you wanted the GPU to do.
kimixa
Having worked on some of the latter part of that era of GPUs, the "frontend" of the shader compiler was a pretty small fraction of the total time cost, most of it was in the later optimization passes that often extremely hardware specific (so not really possible at the level of DXBC). Especially as hardware started to move away from the assumptions used in designing it.
I think a big part of the user-visible difference in stutter is simply the expected complexity of shaders and number of different shaders in an "average" scene - they're 100s of times larger, and CPUs aren't 100s of times faster (and many of the optimization algorithms used are more-than-linear in terms of time vs the input too)
Modern DXIL and SPIR-V are at a similar level of abstraction to DXBC, and certainly don't "solve" stutter.
thereddaikon
Older games used precompiled shaders. These are inaccessible to the game devs and usually handled by the hardware makers, so the platform OEM for consoles and the video card OEM on PCs. Game devs have begged for the ability to write their own shaders for years and finally got it with DX11 and Vulkan. And that's when things went to hell. Instead of the shaders being written and compiled for the specific hardware, they now have to be compiled for your GPU at run time. It's a messy and imperfect process. EA, Ubisoft or anyone else is never going to have the same level of understanding of a GPU that Nvidia or AMD will have. Often the stuttering is due to the shaders having to be recompiled in game, something that never happened before.
maximilianburke
Compiling shaders directly from a high level representation to the GPU ISA only really happens on consoles.
In DirectX on PC, shaders have been compiled into an intermediate form going back to Direct3D 8. All of these intermediate forms are lowered into an ISA-specific instruction set by the drivers.
This final compilation step is triggered lazily when a draw happens, so if you are working on a "modern" engine that uses thousands of different material types your choices to handle this are to a) endure a hiccup as these shaders are compiled the first time they are used, b) force compilation at a load stage (usually by doing like a 1x1 pixel draw), or c) restructure the shader infrastructure by going to a megashader or similar.
dcrazy
> Compiling shaders directly from a high level representation to the GPU ISA only really happens on consoles.
When targeting Apple platforms, you can use the metal-tt tool to precompile your shaders to ISA. You give it a list of target triples and a JSON file that describes your PSO. metal-tt comes with Xcode, and is also available for Windows as part of the Game Porting Toolkit.
Unfortunately, most people don’t do that. They’re spoiled by the Steam monoculture, in which Steam harvests the compiled ISA from gamers’ machines and makes it available on Valve’s CDN.
powerhugs
> Compiling shaders directly from a high level representation to the GPU ISA only really happens on consoles.
No, that's not correct. In fact, it's mostly the other way around. Consoles have known hardware and thus games can ship with precompiled shaders. I know this has been done since at least PS2 era since I enjoy taking apart game assets.
While on PC, you can't know what GPU is in the consumer device.
For example, Steam has this whole concept of precompiled shader downloads in order to mitigate the effect for the end user.
garaetjjte
It was, but it was made worse by DX12/Vulkan. Previously shader stages were separate and fixed pipeline settings were loose bag of knobs. If that corresponded to hardware stages, great, everybody was happy! However, if there was mismatch between API and hardware (for example, some fixed pipeline state wasn't really hardware state but lowered to shader code) then driver needed to hack around it. If that could be done by patching the shader, it was ugly, but it worked fine. But if it required recompilation, then stutters occurred and it couldn't be prevented by application because it was all hidden from it. The solution was thought to encapsulate everything in pipeline state objects. If you could do that, excellent, no stutters no matter on what hardware it runs. It has issues however: in reality applications (especially emulators) don't always know beforehand which states they will need, previous ugly driver tricks that sometimes could have avoided recompilation by patching shaders are no longer possible, and even if the hardware does have easily switchable state API doesn't expose that leading to combinatorial explosion of state objects that was previously unnecessary. Some of that was rolled back by introducing API extensions that allow more granular switches of partial state.
edflsafoiewq
The oldest N64 emulators predate programmable shaders. They mapped possible configurations of the N64 GPU onto configurations of the host GPU. But this is really hard and some configurations are just impossible. It's basically a huge list of tricks and special cases. For reference, I think it was like 16K lines for the color combiner, which basically does (A-B)*C+D.
CrossVR
To answer that question in the context of Vulkan I highly recommend reading the proposal document for VK_EXT_shader_object which contains a great explanation in the problem statement about how modern graphics APIs ended up in this situation:
https://github.com/KhronosGroup/Vulkan-Docs/blob/main/propos...
The gist of it is that graphics APIs like DX11 were designed around the pipelines being compiled in pieces, each piece representing a different stage of the pipeline. These pieces are then linked together at runtime just before the draw call. However the pieces are rarely a perfect fit requiring the driver to patch them or do further compilation, which can introduce stuttering.
In an attempt to further reduce stuttering and to reduce complexity for the driver Vulkan did away with these piece-meal pipelines and opted for monolithic pipeline objects. This allowed the application to pre-compile the full pipeline ahead of time alleviating the driver from having to piece the pipeline together at the last moment.
If implemented correctly you can make a game with virtually no stuttering. DOOM (2016) is a good example where the number of pipeline variants was kept low so it could all be pre-compiled and its gameplay greatly benefits from the stutter-free experience.
This works great for a highly specialized engine with a manageable number of pipeline variants, but for more versatile game engines and for most emulators pre-compiling all pipelines is untenable, the number of permutations between the different variations of each pipeline stage is simply too great. For these applications there was no other option than to compile the full pipeline on-demand and cache the result, making the stutter worse than before since there is no ability to do piece-meal compilation of the pipeline ahead of time.
This gets even worse for emulators that attempt to emulate systems where the pipeline is implemented in fixed-function hardware rather than programmable shaders. On those systems the games don't compile any piece of the pipeline, the game simply writes to a few registers to set the pipeline state right before the draw call. Even piece-meal compilation won't help much here, thus ubershaders were used instead to emulate a great number of hardware states in a single pipeline.
Jasper_
It happened all the time. I've played plenty of D3D11 games where clicking to shoot my gun would stutter the first time it happened as it was compiling shaders for the bullet fire particle effects.
Driver caches mean that after everything gets "prewarmed", it won't happen again.
an-unknown
I think there is some confusion about "ubershaders" in the context of emulators in particular. Old Nintendo consoles like the N64 or the GameCube/Wii didn't have programmable shaders. Instead, it was a mostly fixed-function pipeline but you could configure some stages of it to kind of somewhat fake "programmable" shaders with this configurable pipeline, at least to some degree. Now the problem is, you have no idea what any particular game is going to do, right until the moment it writes a specific configuration value into a specific GPU register, which instantly configures the GPU to do whatever the game wants it to do from that very moment onwards. There literally is no "shader" stored in the ROM, it's just code configuring (parts of) the GPU directly.
That's not how any modern GPU works though. Instead, you have to emulate this semi-fixed-function pipeline with shaders. Emulators try to generate shader code for the current GPU configuration and compile it, but that takes time and can only be done after the configuration was observed for the first time. This is where "Ubershaders" enter the scene: they are a single huge shader which implements the complete configurable semi-fixed-function pipeline, so you pass in the configuration registers to the shader and it acts accordingly. Unfortunately, such shaders are huge and slow, so you don't want to use them unless it's necessary. The idea is then to prepare "ubershaders" as fallback, use them whenever you see a new configuration, compile the real shader and cache it, and use the compiled shader once it's available instead of the ubershader, to improve performance again. A few years ago, the developers of the Dolphin emulator (GameCube/Wii) wrote an extensive blog post about how this works: https://de.dolphin-emu.org/blog/2017/07/30/ubershaders/
Only starting with the 3DS/Wii U, Nintendo consoles finally got "real" programmable shaders, in which case you "just" have to translate them to whatever you need for your host system. You still won't know which shaders you'll see until you observe the transfer of the compiled shader code to the emulated GPU. After all, the shader code is compiled ahead of time to GPU instructions, usually during the build process of the game itself. At least for Nintendo consoles, there are SDK tools to do this. This, of course, means, there is no compilation happening on the console itself, so there is no stutter caused by shader compilation either. Unlike in an emulation of such a console, which has to translate and recompile such shaders on the fly.
> How come this was never a problem for older [...] emulators?
Older emulators had highly inaccurate and/or slow GPU emulation, so this was not really a problem for a long time. Only once the GPU emulations became accurate enough with dynamically generated shaders for high performance, the shader compilation stutters became a real problem.
dcrazy
> Old Nintendo consoles like the N64 or the GameCube/Wii didn't have programmable shaders.
The N64 did in fact have a fully programmable pipeline. [1] At boot, the game initialized the RSP (the N64’s GPU) with “microcode”, which was a program that implemented the RSP’s graphics pipeline. During gameplay, the game uploaded “display lists” of opcodes to the GPU which the microcode interpreted. (I misspoke earlier by referring to these opcodes as “microcode”.) For most of the console’s lifespan, game developers chose between two families of vendor-authored microcode: Fast3D and Turbo3D. Toward the end, some developers (notably Factor5) wrote their own microcode.
Jasper_
Microcode was only used for the RSP, which was a modified MIPS coprocessor and could only realistically be used for T&L. After that, the RSP then sends triangles to the RDP for rasterization, pixel pipeline, and blending, all of which are fixed-function, admittedly with some somewhat flexible color combiner stuff.
whywhywhywhy
[flagged]
aruametello
Its a rather old sample, but from the same author and project (somewhat)
https://www.youtube.com/watch?v=Yl7WvabHfk4
the OG "less compressed" video was on a twitter post that i couldn't find right now.
igorguerrero
Work in Progress
Emulator Support (Plugin) and Ray Tracing (RT) are not available in this repository yet.
null
Here's a 2022 video that shows it in action
https://www.reddit.com/r/Games/comments/v42611/dario_on_twit...