Skip to content(if available)orjump to list(if available)

I want a good parallel computer

I want a good parallel computer

202 comments

·March 21, 2025

deviantbit

"I believe there are two main things holding it back."

He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.

I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.

What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.

I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.

ryukoposting

One of the most important steps of my career was being forced to write code for an 8051 microcontroller. Then writing firmware for an ARM microcontroller to make it pretend it was that same 8051 microcontroller.

I was made to witness the horrors of archaic computer architecture in such depth that I could reproduce them on totally unrelated hardware.

deviantbit

I tell students today that the best way to learn is by studying the mistakes others have already made. Dismissing the solutions they found isn’t being independent or smart; it’s arrogance that sets you up to repeat the same failures.

Sounds like you had a good mentor. Buy them lunch one day.

znpy

I had a similar experience. Our professor in high school would have us program a z80 system entirely by hand: flow chart, assembly code, computing jump offsets by hand, writing the hex code by hand (looking up op-codes from the z80 data sheet) and the loading the opcodes one byte at the time on a hex keypads.

It took three hours and your of us to code an integer division start to finish (we were like 17 though).

The amount of understanding it gave has been unrivalled so far.

Diggsey

> I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with.

So the designers of the Cell processor made some mistakes and therefore the entire concept is bunk? Because you've seen a concept done badly, you can't imagine it done well?

To be clear, I'm not criticising those designers, they probably did a great job with what they had, but technology has moved on a long way from then... The theoretical foundations for memory models, etc. are much more advanced. We've figured out how to design languages to be memory safe without significantly compromising on performance or usability. We have decades of tooling for running and debugging programs on GPUs and we've figured out how to securely isolate "users" of the same GPU from each other. Programmers are as abstracted from the hardware as they've ever been with emulation of different architectures so fast that it's practical on most consumer hardware.

None of the things you mentioned are inherently at odds with more parallel computation. Whether something is a good idea can change. At one point in time electric cars were a bad idea. Decades of incremental improvements to battery and motor technology means they're now pretty practical. At one point landing and reusing a rocket was a bad idea. Then we had improvements to materials science, control systems, etc. that collectively changed the equation. You can't just apply the same old equation and come to the same conclusion.

hulitu

> and we've figured out how to securely isolate "users" of the same GPU from each other

That's the problem, isn't it.

I don't want my programs to act independently, they need to exchange data with each other (copy-paste, drag and drop). Also i cannot do many things in parralel. Some thing must be done sequencially.

deviantbit

[flagged]

aleph_minus_one

> What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on.

Isn't it much more plausible that the people who love to play with exotic (or also retro), complicated architectures (with in this case high performance opportunities) are different people than those who love to "set up or work in an assembly line for shipping stable software"?

> I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.

I rather believe that among those who love this kind of programming a hate for the incompetent fellow student will happen (including wishes that these become weed out by brutal exams).

wmf

The problem is that the exotic complexity enthusiasts cluster in places like HN and sometimes they overwhelm the voices of reason.

0xbadcafebee

> There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.

Nobody teaches it, and nobody writes books about it (not that anyone reads anymore)

deviantbit

So, there are books out there. I use Computer Architecture: A Quantitative Approach by Hennessy and Patterson. Recent revisions have removed historical information. I understand why they did remove it. I wanted to use Stallings book, but the department had already made arrangements with the publisher.

The biggest problem on why we don't write books is that people don't buy them. They take the PDF and stick it on github. Publishers don't respond to the authors on take down requests, github doesn't care about authors, so why spend the time on publishing a book? We can chase grant money. I'm fortunate enough to not have to chase grant money.

pca006132

While financial incentives is important to some, a lot of people write books to share their knowledge and give the book out for free. I think more people are doing this now, and there are also open collaborative textbook projects.

And I personally think that it is weird to write books during your working hour, and also get monet from selling that book.

mabster

I loved and really miss the cell. It did take quite a bit of work to shuffle things in and out of the SPUs correctly (so yeah, it took much longer to write code and greater care), but it really churned through data.

We had a generic job mechanism with the same restrictions on all platforms. This usually meant if it ran at all on Cell it would run great on PC because the data would generally be cache friendly. But it was tough getting the PowerPC to perform.

I understand why the PS4 was basically a PC after that - because it's easier. But I wish there was still SPUs off the side to take advantage of. Happy to have it off die like GPUs are.

sitkack

Those students would all drop out and start meditating. That would be a fun course. Speed run developing for all the prickly architectures of the 80s and 90s.

deviantbit

I see what you did there.

Keyframe

Guru meditation, for the uninitiated.

Yoric

Don't worry, with LLMs, we're moving away from anything that remotely looks like "stable software" :)

Also, yeah, I recall the dreaded days of cooperative multitasking between apps. Moving from Windows 3.x to Linux was a revelation.

hulitu

With LLM's it is just more visible. When the age of "updates" begun, the age of stable software died.

Yoric

True. The quality of code yielded by LLMs would have been deemed entirely unacceptable 30 years ago.

nicoburns

> They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences.

Is there any reason why GPU-style parallelism couldn't have memory protection?

monocasa

It does. GPUs have full MMUs.

PicassoCTs

They do? Then how do i do the forbidden stuff by accessing neighboring pixel data?

grg0

The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.

- You need to compile shader source/bytecode at runtime; you can't just "run" a program.

- On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.

- You need to synchronize data access between CPU-GPU and GPU workloads.

- You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.

- You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.

What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.

newpavlov

>What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does.

For "embarrassingly parallel" jobs vector extensions start to eat tiny bits of the GPU pie.

Unfortunately, just slapping thousands of cores works poorly in practice. You quickly get into the synchronization wall caused by unified memory. GPUs cleverly work around this issue by using numerous tricks often hidden behind extremely complex drivers (IIRC CUDA exposes some of this complexity).

The future may be in a more explicit NUMA, i.e. in the "network of cores". Such hardware would expose a lot of cores with their own private memory (explicit caches, if you will) and you would need to explicitly transact with the bigger global memory. But, unfortunately, programming for such hardware would be much harder (especially if code has to be universal enough to target different specs), so I don't have high hopes for such paradigm to become massively popular.

touisteur

Seems to me there's a trend of applying explicit distributed systems (network of small-SRAM-ed cores each with some SIMD, explicit high-bandwidth message-passing between them, maybe some specialized ASICs such as tensor cores, FFT blocks...) looking at tenstorrent, cerebras, even kalray... out of the CUDA/GPU world, accelerators seem to be converging a bit. We're going to need a whole lot of tooling, hopefully relatively 'meta'.

znpy

It’s weird that no one mentioned xeon phi cards… that’s essentially what they were. Up to 188 (iirc?) x86 atom cores, fully generically programmable.

raphlinus

I consider Xeon Phi to be the shipping version of Larrabee. I've updated the post to mention it.

throwawaynin

Networks of cores ... Congrats you have just taken a computer and shrunk it so there are many on a single chip ... Just gonna say here AWS does exactly this network of computers thing ... Might be profitable

Grosvenor

What I want is a Linear Algebra interface - As Gilbert Strang taught it. I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.

I'm not willing to even know about the HW at all, the higher level my code the more opportunities for the JIT to optimize my code.

What I really want is something like Mathematica that can JIT to GPU.

As another commenter mentioned all the API's assume you're a discrete GPU off the end of a slow bus, without shared memory. I would kill for an APU that could freely allocate memory for GPU or CPU and change ownership with the speed of a pagefault or kernel transition.

RossBencina

> What I really want is something like Mathematica that can JIT to GPU.

https://juliagpu.org/

vgatherps

creata

To expand on this link, this is probably the closest you're going to get to 'I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.' right now. JAX implements a good portion of the Numpy interface - which is the most common interface for linear algebra-heavy code in Python - so you can often just write Numpy code, but with `jax.numpy` instead of `numpy`, then wrap it in a `jax.jit` to have it run on the GPU.

imtringued

I was about to say that it is literally just Jax.

It genuinely deserves to exist alongside pytorch. It's not just Google's latest framework that you're forced to use to target TPUs.

tipsytoad

Like, PyTorch? And the new Mac minis have 512gb of unified memory

amelius

> The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths.

To me it feels somewhat like programming for the segmented memory model with its near and far pointers, back in the old days. What a nightmare.

smallmancontrov

You can have that today. Just go out and buy more CPUs until they have enough cores to equal the number of SMs in your GPU (or memory bandwidth, or whatever). The problem is that the overhead of being general purpose -- prefetch, speculative execution, permissions, complex shared cache hierarchies, etc -- comes at a cost. I wish it was free, too. Everyone does. But it just isn't. If you have a workload that can jettison or amortize these costs due to being embarrassingly parallel, the winning strategy is to do so, and those workloads are common enough that we have hardware for column A and hardware for column B.

tliltocatl

Larrabee was something like that, didn't took off.

IMHO, the real issue is cache coherence. GPUs are spared from doing a lot of extra work by relaxing coherence guarantees quite a bit.

Regarding the vendor situation - that's basically how most of computing hardware is, save for the PC platform. And this exception is due to Microsoft successfully commoditizing their complements (which caused quite some woe on the software side back then).

zozbot234

Is cache coherence a real issue, absent cache contention? AIUI, cache coherence protocols are sophisticated enough that they should readily adapt to workloads where the same physical memory locations are mostly not accessed concurrently except in pure "read only" mode. So even with a single global address space, it should be possible to make this work well enough if the programs are written as if they were running on separate memories.

monocasa

It is because cache coherence requires extra communication to make sure that the cache is coherent. There's cute stratgies for reducing the traffic, but ultimately you need to broadcast out reservations to all of the other cache coherent nodes, so there's an N^2 scaling at play.

fulafel

In the field usually nothing takes off on the first attempt, so this is just a reason to ask "what's different this time" on the following attempts.

actionfromafar

I miss, not exactly Larrabee, but what it could have become. I want just an insane number of very fast, very small cores with their own local memory.

jpc0

I really would like you to sketch out the DX you are expecting here, purely for my understanding of what it is you are looking for.

I find needing to write seperate code in a different language annoying but the UX of it is very explicit of what is happening in the memory which is very useful. With really high performance compute across multiple cores ensuring you don't get arbitrary cache misses is a pain. If we could address CPUs like we address current GPUs( well you can but it's not generally done) it would make it much much simpler.

Want to alter something in parallel, copy it to memory allocated to a specific core which is guaranteed to only be addressed by that core and the do the operations on it.

To do that currently you need to be pedantic about alignment and manually indicate thread affinity to the scheduler etc. Which ia entirely as annoying as GPU programming.

brzozowski

> What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory...

I too am very interested in this model. The Linux kernel supports up to 4,096 cores [1] on a single machine. In practice, you can rent a c7a.metal-48xl [2] instance on AWS EC2 with 192 vCPU cores. As for programming models, I personally find the Java Streams API [3] extremely versatile for many programming workloads. It effectively gives a linear speedup on serial streams for free (with some caveats). If you need something more sophisticated, you can look into OpenMP [4], an API for shared-memory parallelization.

I agree it is time for some new ideas in this space.

[1]: https://www.phoronix.com/news/Perf-Support-2048-To-4096-Core...

[2]: https://aws.amazon.com/ec2/instance-types/c7a/

[3]: https://docs.oracle.com/en/java/javase/24/docs/api/java.base...

[4]: https://docs.alliancecan.ca/wiki/OpenMP

fulafel

Yep, and those printers are proprietary and mutually incompatible, and there are buggy mutually incompatible serial drivers on all the platforms which results in unique code paths and debugging & workarounds for app breaking bugs for each (platform, printer brand, printer model year) tuple combo.

(That was idealized - actually there may be ~5 alternative driver APIs even on a single platform each with its own strengths)

IshKebab

Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.

Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.

mikewarot

The key transformation required to make any parallel architecture work is going to be taking a program that humans can understand, and translating it into a directed acyclic graph of logical Boolean operations. This type of intermediate representation could then be broken up into little chunks for all those small CPUS. It could be executed very slowly using just a few logic gates and enough ram to hold the state, or it could run at FPGA speeds or better on a generic sea of LUTs.

shae

This sounds like graph reduction as done by https://haflang.github.io/ and that flavor of special purpose CPU.

The downside of reducing a large graph is the need for high bandwidth low latency memory.

The upside is that tiny CPUs attached directly to the memory could do reduction (execution).

pabs3

Reminds me of Mill Computing's stuff.

https://millcomputing.com/

zozbot234

Mill Computing's proposed architecture is more like VLIW with lots of custom "tricks" in the ISA and programming model to make it nearly as effective as the usual out-of-order execution than a "generic sea" of small CPU's. VLIW CPU's are far from 'tiny' in a general sense.

worldsayshi

Like interaction nets?

KerrAvon

Isn't that the Connection Machine architecture?

mikewarot

Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.

Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.

Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.

convolvatron

the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.

Grosvenor

I would not complain about getting my own personal Connection Machine.

So long as Tamiko Thiel does the design.

pizza

there’s a differentiable version of this that compiles to C or CUDA: difflogic

deviantbit

Yep, transputers failed miserably. I wrote a ton a code for them. Everything had to be solved in a serial bus, which defeated the purpose of the transputer.

pdimitar

Quite fascinating. Did you write about your experiences in that area? Would love to read it!

deviantbit

Not in those terms, but an autobiography is coming, and bits and pieces are being explained. I expect about 10 people to buy the book, as all of the socialists will want it for free. I am negotiating with a publisher as we speak on the terms.

__d

Could you elaborate on the “serial bus” bit?

turtletontine

Could you elaborate on this? How does many-small-CPUs make for a weirder programming model than a GPU?

Im no expert, but I’ve done my fair share of parallel HPC stuff using MPI, and a little bit of Cuda. And to me the GPU programming model is far far “weirder” and harder to code for than the many-CPUs model. (Granted, I’m assuming you’re describing a different regime?)

dist-epoch

In CUDA you don't really manage the individual compute units, you start a kernel, and the drivers take care of distributing that to the compute cores and managing the data flows between them.

When programming CPUs however you are controlling and managing the individual threads. Of course, there are libraries which can do that for you, but fundamentally it's a different model.

zozbot234

The GPU equivalent of a single CPU "hardware thread" is called a "warp" or a "wavefront". GPU's can run many warps/wavefronts per compute unit by switching between warps to hide memory access latency. A CPU core can do this with two hardware threads, using Hyperthreading/2-way SMT, some CPU's have 4-way SMT, but GPU's push that quite a bit further.

adrian_b

What you say has nothing to do with CPU vs. GPU, or with CUDA, which is basically equivalent with the older OpenMP.

When you have a set of concurrent threads, each thread may run a different program. There are many applications where this is necessary, but such applications are hard to scale to very high levels of concurrency, because each thread must be handled individually by the programmer.

Another case is when all the threads run the same program, but on different data. This is equivalent with a concurrent execution of a "for" loop, which is always possible when the iterations are independent.

The execution of such a set of threads that execute the same program has been named "parallel DO instruction" by Melvin E. Conway in 1963, "array of processes" by C. A. R. Hoare in 1978, "replicated parallel" in the Occam programming language in 1985, SPMD around the same time, "PARALLEL DO" in the OpenMP Fortran language extension in 1997, "parallel for" in the OpenMP C/C++ language extension in 1998, and "kernel execution" in CUDA, which has also introduced the superfluous acronym SIMT to describe it.

When a problem can be solved by a set of concurrent threads that run the same program, then it is much simpler to scale the parallelism to extremely high levels and the parallel execution can usually be scheduled by a compiler or by a hardware controller without the programmer having to be concerned with the details.

There is no inherent difficulty in making a compiler that provides exactly the same programming model as CUDA, but which creates a program for a CPU, not for a GPU. Such compilers exist, e.g. ispc, which is mentioned in the parent article.

The difference between GPUs and CPUs is that the former appear to have some extra hardware support for what you describe as "distributing that to the compute cores and managing the data flows between them", but nobody is able to tell exactly what is done by this extra hardware support and whether it really matters, because it is a part of the GPUs that has never been documented publicly by the GPU vendors.

From the point of view of the programmer, this possible hardware advantage of the GPUs does not really matter, because there are plenty of programming language extensions for parallelism and libraries that can take care of the details of thread spawning and work distribution over SIMD lanes, regardless if the target is a CPU or a GPU.

Whenever you write a program equivalent with a "parallel for", which is the same as writing for CUDA, you do not manage individual threads, because what you write, the "kernel" in CUDA lingo, can be executed by thousands of threads, also on a CPU, not only on a GPU. A desktop CPU like Ryzen 9 9950X has the same product of threads by SIMD lanes like a big integrated GPU (obviously, discrete GPUs can be many times bigger).

IshKebab

I mean weird compared to what already exists.

petermcneeley

Im guessing you just dont have the computational power to compete with a real GPU. It would be relatively easy for a top end graphics programmer to write the front end graphics API for your chip. Im guessing that if they did this you would just end up with a very poor performing GPU.

bryanlarsen

While acknowledging that it's theoretically possible other approaches might succeed, it seems quite clear the author agrees with you.

convolvatron

my take from reading this is more about programming abstractions than any particular hardware instantiation. the part of the Connection Machine that remains interesting is not building machines with CPUS with transistor counts in the hundreds running off a globally synchronous clock, but that there were a whole family of SIMD languages and let you do general purpose programming in parallel. And that those language were still relevant when the architecture changed to a MIMD machine with a bunch of vector units behind each CPU.

snovymgodym

Reminds me of Itanium

CyberDildonics

How is that at all like Itanium except for the superficial headline level where people say they are hard to program?

snovymgodym

Because the main feature that made Itanium hard to program for was its explicit instruction-level parallelism.

audiofish

Picochip?

armchairhacker

> The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?

What other workloads would benefit from a GPU?

Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.

For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.

In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.

I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.

raphlinus

Possibly compilation and linking. That's very slow for big programs like Chromium. There's really interesting work on GPU compilers (co-dfns and Voetter's work).

Optimization problems like scheduling and circuit routing. Search in theorem proving (the classical parts like model checking, not just LLM).

There's still a lot that is slow and should be faster, or at the very least made to run using less power. GPUs are good at that for graphics, and I'd like to see those techniques applied more broadly.

return_to_monke

All of these things you mention are "thinking", meaning they require complex algorithms with a bunch of branches and edge cases.

The tasks that GPUs are good at right now - graphics, number crunching, etc - are all very simple algorithms at the core (mostly elementary linear algebra), and the problems are, in most cases, embarassingly parallel.

CPUs are not very good at branching either - see all the effort being put towards getting branch prediction right - but they are way better at it than GPUs. The main appeal of GPGPU programming is, in my opinion, that if you can get the CPU to efficiently divide the larger problem into a lot of small, simple subtasks, you can achieve faster speeds.

You mentioned compilers. See a related example, for reference all the work Daniel Lemire has been doing on SIMD parsing: the algorithms he (co)invented are all highly specialized to the language, and highly nontrivial. Branchless programming requires an entirely different mindset/intuition than "traditional" programming, and I wouldn't expect the average programmer to come up with such novel ideas.

A GPU is a specialized tool that is useful for a particular purpose, not a silver bullet to magically speed up your code. Theree is a reason that we are using it for its current purposes.

hulitu

> Possibly compilation and linking. That's very slow for big programs like Chromium.

So instead of fixing the problem (Chromium's bloat) we just trow more memory and computing power at it, hopping that the problem will go away.

Maybe we shall teach programmers to programm. /s

wmf

A big one is video encoding. It seems like GPUs would be ideal for it but in practice limitations in either the hardware or programming model make it hard to efficiently run on GPU shader cores. (GPUs usually include separate fixed-function video engines but these aren't programmable to support future codecs.)

dist-epoch

Video encoding is done with fixed-function for power efficiency. A new popular codec like H26x codec appears every 5-10 years, there is no real need to support future ones.

nwallin

Video encoding is two domains. And there's surprisingly little overlap between them.

You have your real time video encoding. This is video conferencing, live television broadcasts. This is done fixed-function not just for power efficiency, but also latency.

The second domain is encoding at rest. This is youtube, netflix, blu-ray, etc. This is usually done in software on the CPU for compression ratio efficiency.

The problem with fixed function video encoding is that the compression ratio is bad. You either have enormous data, or awful video quality, or both. The problem with software video encoding is that it's really slow. OP is asking why we can't/don't have the best of both worlds. Why can't/don't we write a video encoder in OpenCL/CUDA/ROCm. So that we have the speed of using the GPU's compute capability but compression ratio of software.

null

[deleted]

morphle

I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.

For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.

https://en.wikipedia.org/wiki/Roofline_model

dekhn

There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.

Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.

See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.

AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.

I've found the best strategy is to target my development for what the high end consumers are buying in 2 years - this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")

Animats

Interesting article.

Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.

In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.

Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)

While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.

There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.

In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.

The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process. This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.

Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.

[1] https://github.com/linebender/vello/

01HNNWZ0MV43FF

2D rendering is harder in fact, because antialiased curves are harder than triangle soup.

It's an issue of code complexity, not fill rate

https://faultlore.com/blah/text-hates-you/

archagon

I think a dynamic, fully vector-based 2D interface with fluid zoom and transformations at 120Hz+ is going to need all the GPU help it can get. Take mapping as an example: even Google Maps routinely struggles with performance on a top-of-the-line iPhone.

hulitu

> even Google Maps routinely struggles with performance on a top-of-the-line iPhone.

It has to download the jpegs.

mattdesl

> Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D. Now, 3D renderers, we need all the help we can get.

A ton of 2D applications could benefit from further GPU parallelization. Games, GUIs, blurs & effects, 2D animations, map apps, text and symbol rendering, data visualization...

Canvas2D in Chrome is already hardware accelerated, so most users get better performance and reduced load on main UI & CPU threads out of the box.

jms55

Fast light transport is an incredibly hard problem to solve.

Raytracing (in its many forms) is one solution. Precomputing lightmaps, probes, occluder volumes, or other forms of precomputed visibility are another.

In the end it comes down to a combination of target hardware, art direction and requirements, and technical skill available for each game.

There's not going to be one general purpose renderer you can plug into anything, _and_ expect it to be fast, because there's no general solution to light transport and geometry processing that fits everyone's requirements. Precomputation doesn't work for dynamic scenes, and for large games leads to issues with storage size and workflow slow downs across teams. No precomputation at all requires extremely modern hardware and cutting edge research, has stability issues, and despite all that is still very slow.

It's why game engines offer several different forms of lighting methods, each with as many downsides as they have upsides. Users are supposed to pick the one that best fits their game, and hope it's good enough. If it's not, you write something custom (if you have the skills for that, or can hire someone who can), or change your game to fit the technical constraints you have to live with.

> Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.

Some games may have their own acceleration structures. Some won't. Some will only have them on the GPU, not the CPU. Some will have an approximate structure used only for specialized tasks (culling, audio, lighting, physics, etc), and cannot be generalized to other tasks without becoming worse at their original task.

Fully generalized solutions will be slow be flexible, and fully specialized solutions will be fast but inflexible. Game design is all about making good tradeoffs.

Animats

The same argument could be made against Vulkan, or OpenGL, or even SQL databases. The whole NoSQL era was based on the concept that performance would be better with less generality in the database layer. Sometimes it helped. Sometimes trying to do database stuff with key/value stores made things worse.

I'm trying to find a reasonable medium. I have a hard scaling problem - big virtual world, dynamic content - and am trying to make that work well. If that works, many games with more structured content can use the same approach, even if it is overkill.

pjmlp

At Vulkanised 2025 someone mentioned it is an HAL for writing GPU drivers, and they have acknowledge it has gotten as messy as OpenGL and there is now a plan in place to try to sort the complexity mess.

amelius

> Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.

Depends on how complicated your artwork is.

Animats

There are only so many screen pixels.

amelius

You can have an unlimited number of polygons overlapping a pixel. For instance, if you zoom out a lot. Imagine you converted a layer map of a modern CPU design to svg, and tried to open it in Inkscape. Or a map of NYC. Wouldn't you think a bit of extra processing power would be welcomed?

hulitu

> Modern GPUs are overkill for 2D

That explains why modern GUI are crap: because they are not able to draw a bloody rectangle, and fill it with colour. /s

bee_rider

It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).

> As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.

I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.

nullpoint420

I've always wondered if you could use iGPU compute cores with unified memory as "transparent" E-cores when needed.

Something like OpenCL/CUDA except it works with pthreads/goroutines and other (OS) kernel threading primitives, so code doesn't need to be recompiled for it. Ideally the OS scheduler would know how to split the work, similar to how E-core and P-core scheduling works today.

I don't do HPC professionally, so I assume I'm ignorant to why this isn't possible.

Retr0id

Isn't Xeon Phi just an instance of Larrabee?

adrian_b

It is an instance of Larrabee in the same sense as AMD Zen 4 is an instance of Larrabee.

The "Larrabee New Instructions" is an instruction set that has been designed before AVX and also its first hardware implementation has been introduced before AVX, in 2010 (AVX was launched in 2011, with Sandy Bridge).

Unfortunately while the hardware design of Sandy Bridge with the inferior AVX ISA has been done by the Intel A team, the hardware implementations of Larrabee have been done by some C or D teams, which were also not able to design new CPU cores for it, but they had to reuse some obsolete x86 cores, initially a Pentium core and later an Atom Silvermont core, to which the Larrabee instructions were grafted.

"Larrabee New Instructions" have been renamed to "Many Integrated Cores" ISA, then to AVX-512, while passing through 3 generations of chips, Knights Ferry, Knights Corner and Knights Landing. A fourth generation, Knights Mill, was only intended for machine learning/AI applications. The successor of Knights Landing has been Skylake Server, when the AVX-512 ISA has come to standard Xeons, marking the disappearance of Xeon Phi.

Already in 2013, Intel Haswell has added to AVX a few of the more important instructions that were included in the Larrabee New Instructions, but which were missing in AVX, e.g. fused multiply-add and gather instructions. The 3-address FMA format, which has caused problems to AMD, who had implemented in Bulldozer a 4-address format, has also come to AVX from Larrabee, replacing the initial 4-address specification.

At each generation until Skylake Server, some of the original Larrabee instructions have been deleted, by assuming that they might be needed only for graphics, which was no longer the intended market. However a few of those instructions were really useful for some applications in which I am interested, e.g. for computations with big numbers, so I regret their disappearance.

Since Skylake Server, there have been no other instruction removals, with the exception of those introduced by Intel Tiger Lake, which are now supported only by AMD Zen 5. A few days ago Intel has committed to keeping complete compatibility in the future with the ISA implemented today by Granite Rapids, so there will be no other instruction deletions.

raphlinus

> It is an instance of Larrabee in the same sense as AMD Zen 4 is an instance of Larrabee.

This is an odd claim. Clearly Xeon Phi is the shipping version of Larrabee, while Zen 4 is a completely different chip design that happens to run AVX-512. The first shipping Xeon Phi (Knights Corner) used the exact same P54C cores as Larrabee, while as you point out later versions of Xeon Phi switched to Atom.

It is extremely common to refer to all these as Larrabee, for example the Ian Cutress article on the last Xeon Phi chip was entitled "The Larrabee Chapter Closes: Intel's Final Xeon Phi Processors Now in EOL" [1]. Pat Gelsinger's recent interview at GTC [2] also refers to Larrabee. The section from around 44:00 has a discussion of workloads becoming more dynamic, and at 53:36 there's a section on Larrabee proper.

[1]: https://www.anandtech.com/show/14305/intel-xeon-phi-knights-...

[2]: https://www.youtube.com/live/pgLdJq9FRBQ

null

[deleted]

ip26

I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs … struggle when the workload is dynamic

This sacrifice is a purposeful cornerstone of what allows GPUs to be so high throughput in the first place.

Retr0id

Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified - which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.

jms55

You can. There are API extensions for persistently mapping memory, and it's up to you to ensure that you never write to a buffer at the same time the GPU is reading from it.

At least for Vulkan/DirectX12. Metal is often weird, I don't know what's available there.

deviantbit

Unified memory doesn't mean unified address space. It frustrates me when no one understands unified memory.

morphle

If you fix the pages tables (partial tutorial online) you can have continuous unified address space on Apple Silicon.

deviantbit

Let’s be honest, saying “just fix the page tables” is like telling someone they can fly if they “just rewrite gravity.”

Yes, on Apple Silicon, the hardware supports shared physical memory, and with enough “convincing”, you can rig up a contiguous virtual address space for both the CPU and GPU. Apple’s unified memory architecture makes that possible, but Apple’s APIs and memory managers don’t expose this easily or safely for a reason. You’re messing with MMU-level mappings on a tightly integrated system that treats memory as a first-class citizen of the security model.

I can tell you never programmed on an Amiga.

svmhdvn

I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.

scroot

When this topic comes up, I always think of uFork [1]. They are even working on an FPGA prototype.

[1] https://ufork.org/