The Prospero Challenge

41 comments

·March 24, 2025

Visit

kevmo314

Nice challenge, I got it down to 0.5ms/frame. https://github.com/kevmo314/prospero.vm

jstanley

So 2000 fps, 16 million pixels, and 7000 operations per pixel, works out to 224 TFLOPS.

An RTX 4090 is advertised as being able to compute 82 TFLOPS.

Where do you think the extra is coming from? Is it just straightforward optimisations like constant folding? Or do you think the compiler is noticing that 1000 iterations of the loop doesn't change the answer and optimising it down to just 1 loop?

Jtsummers

There are 7866 instructions including the constants. There are 1406 constants leaving only 6460 real instructions to be executed (max, min, add, neg, sub, square, sqrt). Those constants can be directly encoded into most (possibly all) of the instructions when mapped to real machine instructions, which a compiler would likely do unless there was a good reason to keep it in a register or memory location.

Something I saw from a cursory scan were some near duplicate instruction (haven't written code to find all instances):

  a = x - c
  ;; some time later
  b = c - x

Recognizing that b is the negation of a, you can convert the calculation of b to:

  b = -a

This may or may not be faster, but it does mean that we can possibly forget about c earlier (helpful for register allocation and can reduce memory accesses).

Negations can sometimes be removed depending on the lifetime of the variable and its uses. Taking the above, suppose we follow it with this use of b:

  d = e + b ;; or b + e

We can rewrite this as:

  d = e - a

Or if it had been:

  d = e - b

It can be rewritten to:

  d = e + a

And if there are no more uses of b we've eliminated both that variable and another instruction from the program. These and other patterns are things a compiler would detect and optimize.

Though looking at the uses of the results from neg, I think most are used in the sequence of max/min instructions following them so it may not be possible to eliminate them as I showed above.

ashdnazg

Pretty sure it's optimisations.

Even small things like converting multiplications followed by additions to FMA will reduce the operation count.

Add to that constant folding etc. and a speedup factor of ~3 is not so hard to imagine.

phdelightful

I compiled it for Ampere and counted 6834 actual F32 operations in the SASS after optimizations. I only counted FFMA, FADD, FMUL, FMNMX, and MUFU.RSQ after eyeballing the SASS code, so there might even be more. It's possible the FMNMX doesn't actually take a FLOP since you can do f32 max as an integer operation, and perhaps MUFU.RSQ doesn't either, but even if you only count FFMA, FADD, and FMUL there are still 3685 ops.

  nvcc -arch=sm_86 prospero.cu -o prospero
  cuobjdump -sass prospero | grep -E 'FFMA|FADD|FMUL|FMNMX|MUFU\.RSQ' | wc -l

kevmo314

It’s a good question, admittedly I don’t know. I looked into it when Matt mentioned it but I’m not so familiar with what happens after ptx to say.

If someone does know I’d love to learn why though.

null

[deleted]

montag

Impressive. Did you do all this in the past 2 hours?

kevmo314

Yeah, it took me an hour-ish to get the performance and then an hour-ish to figure out how to actually render it correctly. I had never seen the P2/P5 PGM format before so I didn't realize P5 files were binary. :)

ayastrebov2

Great idea and clean implementation! You seem to be interested in performance engineering - I did some work recently, check this out https://github.com/AlexanderYastrebov/wireguard-vanity-key

adrian17

> (Spoilers: it also allocates 60+ GB of RAM for intermediate results, so consider reducing the image size before running it on your own machine)

The input algorithm is essentially written in SSA form, and it's easy to analyze when each "register" stops being used; then you can drop the arrays after their last use. Turns out the number of live registers never exceeds 160 at any point in time, and with this one addition the max memory usage drops to barely above 1GB.

Jtsummers

I don't know why you are downvoted (at the time of this writing). That's a very good point and makes the simple Python solution feasible on many more computers (due to reduced memory footprint). It's also a straightforward pass (once you know what liveness analysis is). Since there are no loops it's just one pass starting from the end and working backwards to determine whether a variable is live or not, and then an extra check on each iteration to decide whether to delete particular arrays.

tekknolagi

I think getting downvoted (not by me) because if you scroll to the end of the post, that's the very first submission (by me).

Jtsummers

Ah, fair. I didn't catch that.

edflsafoiewq

Converted it to GLSL that can be dropped into shadertoy.com/new. I get about 15FPS at 714x401.

https://pastebin.com/jBwNrfim

stevage

> If you evaluate the expression with (x,y) values in the ±1 square, then color pixels black or white based on their sign

I do not understand what this means at all, particularly the bit before the comma.

Can someone explain it differently?

cjs_ac

Yeah, this took me a while. Basically, the entire file represents a sort of assembly language for a function. You need to evaluate that entire function for each pixel in the output image.

Each line starts with a line number (with a leading underscore). This is followed by an instruction, and zero, one, or two arguments. The arguments can be either floating point constants or integers (with leading underscores) which represent the results of those corresponding lines earlier in the function.

The x and y coordinates of each pixel are loaded in the assembly language using the var-x and var-y instructions.

jstanley

Break down the range of x values from -1 to 1, and y values from -1 to 1, each into 1024 parts (or whatever resolution you want to plot at). Evaluate the function for each (x,y) coordinate. If the resulting number is negative colour the pixel white, otherwise colour it black.

The expression is (probably) a 2d signed-distance function - https://en.wikipedia.org/wiki/Signed_distance_function

stevage

Ohh.. "the +/- 1 square" means a square of size 2 centred around the origin? That was way too terse for me.

iamwil

For a given (x, y), you color the pixel black if you're outside of a shape, and color the pixel white if you're inside the shape.

The math formula is a way to describe the contour of the shape. To tell if you're inside or outside of a shape, you plug in your (x, y) coordinate into the math formula, and if the output is > 0, then you're outside. If the output is < 0, then you're inside.

I think they call it a signed distance field. You should watch his talk that explain it. It's really great.

https://www.youtube.com/watch?v=UxGxsGnbyJ4

RainyDayTmrw

Can anyone explain where this blob of "assembly language" comes from? In practice, the author presumably with the desired output image and backed into this esoteric format, but I'm not following how.

What is considered an acceptable preprocessing or transformation? In the limit case, a static executable that computes nothing and outputs a constant, literal image probably violates the intention of the question.

jstanley

> In practice, the author presumably with the desired output image and backed into this esoteric format, but I'm not following how.

I actually don't think this is how he did it. You might notice the font is kind of weird.

I think he started with an SDF font and created the "assembly language" from that. https://en.m.wikipedia.org/wiki/Signed_distance_function

mkeeter

> Can anyone explain where this blob of "assembly language" comes from?

Assembly language is definitely the right analogy: it's a low-level target generated by higher-level tools. In this case, the expression came from a Python script calling this text(...) function:

https://github.com/mkeeter/antimony/blob/f6a56dd7/py/fab/sha...

The font is hand-built from geometric primitives (rectangles, circles, etc) and CSG operations (union, intersection, difference)

> What is considered an acceptable preprocessing or transformation?

I'm looking for interesting ideas, and to mine the depths of PLs / compiler / interpreter / runtime research. Just returning a fixed image isn't particularly interesting, but (for example) I just updated the site with a compile-to-CUDA example that shows off the brute force power of a modern GPU.

mattdesl

What are the practical implications of this kind of assembly language? Surely there’s more efficient means of describing 2D SDFs?

Fun exercise! I’ve been enjoying trying to find some new ways to approach the challenge. I managed to build a single string expression for the entire program, so it could be evaluated per-pixel in a shader, but it turns out the expression is too complex for WebGL & WebGPU and the shader fails to compile.

My next thought would be to evaluate the program at a low resolution to create a low res SDF texture for the shader to draw at a higher resolution. Some information will probably be lost, though.

mkeeter

> What are the practical implications of this kind of assembly language? Surely there’s more efficient means of describing 2D SDFs?

By analogy, you wouldn't program in LLVM IR, but it's a useful intermediate representation for a bunch of higher-level languages. Higher-level tools can target this representation, and then they all get to use a standard set of optimizations and algorithms (fast evaluation, rendering, etc).

(I gave a recent talk that's a broad overview of this research: https://www.youtube.com/watch?v=UxGxsGnbyJ4)

> My next thought would be to evaluate the program at a low resolution to create a low res SDF texture for the shader to draw at a higher resolution.

Glad you're enjoying the challenge! You may also be interested in

https://www.redblobgames.com/x/2403-distance-field-fonts/

cfbolztereick

This is really fun, I'm getting badly nerd sniped by this.

Titan2189

It's been 4 hours already. How are you doing? Didn't forget to eat and drink? Just checking in on you :)

6stringmerc

Whoa cool invocation of the magician Prospero! I recently adapted “The Tempest” as a screenplay while in jail and typed it up:

https://samhenrycliff.medium.com/adapting-shakespeares-the-t...

yoz

Thanks so much for writing this up and sharing. It sounds like you've been through a horrific few years, but your perseverance in creativity is inspiring and fascinating. I wish you well in your efforts to make a better life for yourself, and I hope your work finds a wider audience.

neverokay

What compelled you in jail? I’m asking if there was more to it than just being bored.

yolhan83

RTX 4060, done in julia 20 fps (for 4096X4096) (47ms per frame) in https://github.com/yolhan83/ProsperoChal It was quite fun !

IshKebab

Hash the input, if it matches the expected input output the expected image. Otherwise error.

Technically correct.

ainiriand

This is why we can't have nice things.

mkeeter

Quoth the article: "Obviously, you could precompute the entire image, but that's against the spirit of the challenge"

crazygringo

Why waste time hashing?

Just output the expected image in all cases.

Still technically correct. ;)

James_K

Some of these solutions seem to copy the .vm file and rewrite it in another programming language. Is this permitted? Seems like cheating to not count the time spent compiling and transforming that code in the runtime of the program.

zomglings

Can you explain how you produce the opcodes from a text/image?

edflsafoiewq

Looks SDF-ish. My guess is convert the letters to curves, cut them into pieces without holes so each piece is the set of points between certain curves, write down SDFs for each curve (negative on the "inside", positive on the "outside"), combine all SDFs in a piece with intersection (max), then combine all pieces with union (min).

iamwil

the text contains a math expression. The math expression can be interpreted as an abstract syntax tree. The AST can also be translated into bytecode and interpreted by a stack-based virtual machine.

The op-codes are where you do that translation from an AST into bytecode. The entire 2nd half of the crafting interpreters book will guide you through how it's done. https://craftinginterpreters.com/a-bytecode-virtual-machine....

jstanley

I'm thinking that we can parse the input into an expression tree, and then for each x coordinate, have each node in the tree tell us which intervals of y coordinates it would return a value < 0 (or <= 0 without loss of generality). Since we're basically dealing with real numbers we can pretend true 0 doesn't exist.

Composing these intervals is trivial for all operations except addition/subtraction. (For example, `mul` is less than 0 if exactly one of its arguments is less than 0, `neg` is less than 0 if its argument is not less than 0, `min` less than 0 if either of its arguments is less than 0, etc.).

And then we define add(a, b) as sub(a, neg(b)). So then we only need to work out which y values sub(a, b) will return a negative result for.

sub(a,b) is negative when a<b.

We can have the sub() node asks its 2 child nodes which values they would return a negative result for. Every y coordinate where a gives a negative result and b gives a positive result is a y coordinate where sub(a,b) is negative. Every y coordinate where a is positive and b is negative is positive. For the remaining y values I think we have to actually evaluate the children and find out which ones give a negative result.

Obviously memoise the evaluation so that we don't repeat any work, and turn the expression tree into a DAG by combining any identical nodes. Some subtrees won't contain var-x, so the memoisation of those nodes can persist across different x coordinates.

And then for each x coordinate we ask the expression tree to tell us which y coordinates give negative outputs, and plot those. It's possible that the idea would generalise to 2-dimensional intervals, not sure.

I haven't implemented this yet but I'm planning to try it out, but to be honest I don't expect it to be faster than Matt's GPU version based on recursive subdivision with interval arithmetic. Btw his talk is great https://www.youtube.com/watch?v=UxGxsGnbyJ4&t=80s

And, a second fun challenge would be to reverse engineer the input file to extract the font! I expect the file is basically a union of x/y translations of functions that plot individual letters, so it shouldn't be crazy hard to split out the top-level union, then split-out the x/y translations, and then collect the letters. It's possible that it's more complicated though. In particular, is each character translated in x/y individually, or is each character translated only in x to form each line of text, and then the line as a whole is translated in y? Or something weirder?

HN

The Prospero Challenge

The Prospero Challenge