Show HN: I built a hardware processor that runs Python

269 comments

·April 28, 2025

Hi everyone, I built PyXL — a hardware processor that executes a custom assembly generated from Python programs, without using a traditional interpreter or virtual machine. It compiles Python -> CPython Bytecode -> Instruction set designed for direct hardware execution.

I’m sharing an early benchmark: a GPIO test where PyXL achieves a 480ns round-trip toggle — compared to 14-25 micro seconds on a MicroPython Pyboard - even though PyXL runs at a lower clock (100MHz vs. 168MHz).

The design is stack-based, fully pipelined, and preserves Python's dynamic typing without static type restrictions. I independently developed the full stack — toolchain (compiler, linker, codegen), and hardware — to validate the core idea. Full technical details will be presented at PyCon 2025.

Demo and explanation here: https://runpyxl.com/gpio Happy to answer any questions

Visit

zik

This is a very cool project but I feel like the claim is overstated: "PyXL is a custom hardware processor that executes Python directly — no interpreter, no JIT, and no tricks. It takes regular Python code and runs it in silicon."

Reading further down the page it says you have to compile the python code using CPython, then generate binary code for its custom ISA. That's neat, but it doesn't "execute python directly" - it runs compiled binaries just like any other CPU. You'd use the same process to compile for x86, for example. It certainly doesn't "take regular python code and run it in silicon" as claimed.

A more realistic claim would be "A processor with a custom architecture designed to support python".

goranmoomin

Not related to the project in any way, but I would say that if the hardware is running on CPython bytecode, I’d say that’s as far as it can get for executing Python directly – AFAIK running python code with the `python3` executable also compiles Python code into bytecode `*.pyc` files before it runs it. I don’t think anyone claims that CPython is not running Python code directly…

hamandcheese

I agree with you, if it ran pyc code directly I would be okay saying it "runs python".

However it doesn't seem like it does, the pyc still had to be further processed into machine code. So I also agree with the parent comment that this seems a bit misleading.

I could be convinced that that native code is sufficiently close to pyc that I don't feel misled. Would it be possible to write a boot loader which converts pyc to machine code at boot? If not, why not?

f1shy

Well it really does not run CPython, but CPython bytecode, compiled down to an assembler. Granted, a very specific, tailored assembler, but still.

Anyway, the project is mega-cool, and very useful (in some specific applications). Is just that the title is a little bit confusing.

hwpythonner

Fair point if you're looking at it through a strict compiler-theory lens, but just to clarify—when I say "runs Python directly," I mean there is no virtual machine or interpreter loop involved. The processor executes logic derived from Python ByteCode instructions.

What gets executed is a direct mapping of Python semantics to hardware. In that sense, this is more “direct” than most systems running Python.

This phrasing is about conveying the architectural distinction: Python logic executed natively in hardware, not interpreted in software.

franzb

Wouldn't an AoT Python-to-x86 compiler lead to a similar situation where the x86 processor would "run Python directly"?

_kidlike

After a quick search I found that even Raspberry makes the same claim...

"runs directly on embedded hardware"

https://www.raspberrypi.com/documentation/microcontrollers/m...

I don't understand why they have the need to do this...

rcxdude

Micropython does run directly on the hardware, though. It's a bare-metal binary, no OS. Which is a different claim to running the python code you give it 'directly'.

f1shy

Well, runing python on Raspian, you could toggle a pin at maximum a couple of KHz, not near the 2 MHz you can do with this project. Also it claims predictability, so I assume the time jitter is much less, which is a very important parameter for real time applications.

hwpythonner

PyXL is a bit more direct :)

dividuum

Huh? MicroPython literally does exactly that: You copy over Python source(!) code and it runs on the Pico.

wormius

Yeah that was my first thing. Wait a minute you run a compiler on it? It's literally compiled code, not direct. Which is fine, but yeah, overselling what it is/does.

Still cool, but I would definitely ease back the first claim.

I was going to say it does make me wonder how much a pain a direct processor like this would be in terms of having to constantly update it to adapt to the new syntax/semantics everytime there's a new release.

Also - are there any processors made to mimic ASTs directly? I figure a Lisp machine does something like that, but not quite... Though I've never even thought to look at how that worked on the hardware side.

EDIT: I'm not sure AST is the correct concept, exactly, but something akin to that... Like building a physical structure of the tree and process it like an interpreter would. I think something like that would require like a real-time self-programming FPGA?

hwpythonner

PyXL deliberately avoids tying itself to Python’s high-level syntax or rapid surface changes.

The system compiles Python source to CPython ByteCode, and then from ByteCode to a hardware-friendly instruction set. Since it builds on ByteCode—not raw syntax—it’s largely insulated from most language-level changes. The ByteCode spec evolves slowly, and updates typically mean handling a few new opcodes in the compiler, not reworking the hardware.

Long-term, the hardware ISA is designed to remain fixed, with most future updates handled entirely in the toolchain. That separation ensures PyXL can evolve with Python without needing silicon changes.

BiteCode_dev

Which is what nuitka does. But the result doesn't allow for real time python programs, andy you don't get direct access to the hardware like here.

rytill

The phrasing “<statement> — no X, Y, Z, just <final simplified claim>” is cropping up a lot lately.

4o also ends many of its messages that way. It has to be related.

Y_Y

Are there any limitations on what code can run? (discounting e.g. memory limitations and OS interaction)

I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.

I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).

hwpythonner

Thanks — really appreciate the interest!

There are definitely some limitations beyond just memory or OS interaction. Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware. I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.

Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.

As for the design process — I’d love to share more! I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.

mikepurvis

In terms of a feature-set to target, would it make sense to be going after RPython instead of "real" Python? Doing that would let you leverage all the work that PyPy has done on separating what are the essential primitives required to make a Python vs what are the sugar and abstractions that make it familiar:

https://doc.pypy.org/en/latest/faq.html#what-is-pypy

ammar2

> I'd prefer to move forward based on clear use cases

Taking the concrete example of the `struct` module as a use-case, I'm curious if you have a plan for it and similar modules. The tricky part of course is that it is implemented in C.

Would you have to rewrite those stdlib modules in pure python?

mikepurvis

As in my sibling comment, pypy has already done all this work.

CPython's struct module is just a shim importing the C implementations: https://github.com/python/cpython/blob/main/Lib/struct.py

Pypy's is a Python(-ish) implementation, leveraging primitives from its own rlib and pypy.interpreter spaces: https://github.com/pypy/pypy/blob/main/pypy/module/struct/in...

The Python stdlib has enormous surface area, and of course it's also a moving target.

bokchoi

There were a few chips that supported directly executing JVM bytecodes. I'm not sure why it didn't take off, but I think it is generally more performant to JIT compile hotspots to native code.

https://en.wikipedia.org/wiki/Java_processor

teruakohatu

It did take off just in a different direction:

https://en.m.wikipedia.org/wiki/Java_Card

To the point where most adult humans in the world probably own a Java-supported processor on a SIM card. Or at least an emulator (for eSIMs).

On example of a CPU arch used on JavaCard devices is the ARM926EJ-S that I believe can execute Java byte code.

null

[deleted]

tsukikage

Running bytecode directly on hardware has certainly been tried (e.g. ARM's Jazelle).

In today's world this is generally not great.

Interpreted languages often include bytecode instructions that actually do very complex things and so do not nicely map to operations that can be sanely implemented in hardware. So you end up with all the usual boring alu, branch etc operations implemented in hardware, and anything else traps and runs a software handler.

Separately, interpreted language bytecode is often a poor fit for hardware execution; e.g. for dotnet (and python) bytecode many otherwise trivial operations do not explicitly encode information about types, and therefore the hardware must track type information in order to do the right thing (floating point addition looks very very different from integer addition!)

A lot of effort has been spent on compiler optimisation for x86 and ARM code. JIT compilers benefit massively from this. Meanwhile, interpreted language bytecode is often very lightly optimised, where it is optimised at all (until relatively recently, explicit Python policy as set by Guido van Rossum was to never optimise!) Optimisation has the side effect of throwing away potentially valuable high level / semantic information; optimising at the bytecode level hinders debuggability for interpreted code (which is a primary goal in Python) and can also be detrimental to JIT output; and the results are underwhelming compared to JIT since your small team of plucky bytecode optimisers isn't really going to compete with decades of x86 compiler development; and so the incentive is to not do much of that.

So if you're running bytecode in hardware, on top of all the obvious costs, you are /running unoptimised code/. This is actually the thing that kills these projects - everything else can ultimately be solved by throwing more silicon at it, but this can only really be solved by JITting, and the existing JIT+x86 / JIT+ARM solution is cheap and battle tested.

f1shy

I understand that is the reason Lisp Machines were dropped (even in the time where Lisp was still a very good seen language). At least I understand so in the SICP videos, like in 1986 it was already clear it was much better to compile to ASM.

checker659

Forth CPU (in SystemVerilog): https://www.youtube.com/watch?v=DRtSSI_4dvk

hermitShell

JVM I think I can understand, but do you happen to know more about LISP machines and whether they use an ISA specifically optimized for the language, or if the compilers for x86 end up just doing the same thing?

In general I think the practical result is that x86 is like democracy. It’s not always efficient but there are other factors that make it the best choice.

kragen

They used an ISA specifically optimized for the language. At the time it was not known how to make compilers for Lisp that did an adequate job on normal hardware.

The vast majority of computers in the world are not x86.

f1shy

Wait. It was pretty well known how to make compilers for Lisp, and they were not bad. There were some little parts of some lisps (number tower, overflow to bignum, rationals) which was problematic (and still is today, if you do not have custom HW). But those pieces were and are not that important for general purpose. The era of LISP isa was not so long after all.

f1shy

When the RISC processors were available (for the same reason RISC started to grow) it was better to just compile to ASM.

hwpythonner

I built a hardware processor that runs Python programs directly, without a traditional VM or interpreter. Early benchmark: GPIO round-trip in 480ns — 30x faster than MicroPython on a Pyboard (at a lower clock). Demo: https://runpyxl.com/gpio

jonjacky

A much earlier (2012) attempt at a Python bytecode interpreter on an FPGA:

https://pycpu.wordpress.com/

"Running a very small subset of python on an FPGA is possible with pyCPU. The Python Hardware Processsor (pyCPU) is a implementation of a Hardware CPU in Myhdl. The CPU can directly execute something very similar to python bytecode (but only a very restricted instruction set). The Programcode for the CPU can therefore be written directly in python (very restricted parts of python) ..."

boutell

This is very, very cool. Impressive work.

I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.

The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...

rangerelf

> I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.

It almost sounds like you're asking for Nim ( https://nim-lang.org/ ); and there are some projects using it for microcontroller programming, since it compiles down to C (for ESP32, last I saw).

obitsten

Why is it not routine to "compile" Python? I understand that the interpreter is great for rapid iteration, cross compatibility, etc. But why is it accepted practice in the Python world to eschew all of the benefits of compilation by just dumping the "source" file in production?

cchianel

The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library). Without type annotations, there is very little for a non-JIT compiler to optimize, since:

- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.

- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).

- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).

A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.

hwpythonner

You make a great point — type information is definitely a huge part of the challenge.

I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.

So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.

That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.

pjmlp

Solved by Smalltalk, Self, and Lisp JITs, that are in the genesis of JIT technology, some of it landed on Hotspot and V8.

dragonwriter

Python starting with 3.13 also has a JIT available.

jonathaneunice

"Addressed" or "mitigated" perhaps. Not "solved." Just "made less painful" or "enough less painful that we don't need to run screaming from the room."

Qem

> The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library).

When type annotations are available, it's already possible to compile Python to improve performance, using Mypyc. See for example https://blog.glyph.im/2022/04/you-should-compile-your-python...

Someone

Python doesn’t eschew all benefits of compilation. It is compiled, but to an intermediate byte code, not to native code, (somewhat) similar to the way java and C# compile to byte code.

Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.

Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.

jerf

If you define "compiling Python" as basically "taking what the interpreter would do but hard-coding the resulting CPU instructions executed instead of interpreting them", the answer is, you don't get very much performance improvement. Python's slowness is not in the interpreter loop. It's in all the things it is doing per Python opcode, most of which are already compiled C code.

If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.

jonathaneunice

A giant part of the cost of dynamic languages is memory access. It's not possible, in general, to know the type, size, layout, and semantics of values ahead of time. You also can't put "Python objects" or their components in registers like you can with C, C++, Rust, or Julia "objects." Gradual typing helps, and systems like Cython, RPython, PyPy etc. are able to narrow down and specialize segments of code for low-level optimization. But the highly flexible and dynamic nature of Python means that a lot of the work has to be done at runtime, reading from `dict` and similar dynamic in-memory structures. So you have large segments of code that are accessing RAM (often not even from caches, but genuine main memory, and often many times per operation). The associated IO-to-memory delays are HUGE compared to register access and computation more common to lower-level languages. That's irreducible if you want Python semantics (i.e. its flexibility and generality).

Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.

For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.

ModernMech

Part of the issue is the number of instructions Python has to go through to do useful work. Most of that is unwrapping values and making sure they're the right type to do the thing you want.

For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.

If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.

That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."

kragen

You don't need the whole Python interpreter to fall back to dynamic method dispatch for overloaded operators. CPython itself implements them with per-interface vtables for C extensions, very similar to Golang but laboriously constructed by hand.

For most code, you don't need static typing for most overloaded operators to get decent performance, either. From my experience with Ur-Scheme, even a simple prediction that most arithmetic is on (small) integers with a runtime typecheck and conditional jump before inlining the integer version of each arithmetic operation performs remarkably well—not competitive with C but several times faster than CPython. It costs you an extra conditional branch in the case where the type is something else, but you need that check anyway if you are going to have unboxed integers, and it's smallish compared to the call and return you'll need once you find the correct overload to call. (I didn't implement overloading in Ur-Scheme, just exiting with an error message.)

Even concatenating strings is slow enough that checking the tag bits to see if you are adding integers won't make it much slower.

Where this approach really falls down is choosing between integer and floating point math. (Also, you really don't want to box your floats.)

And of course inline caches and PICs are well-known techniques for handling this kind of thing efficiently. They originated in JIT compilers, but you can use them in AOT compilers too; Ian Piumarta showed that.

franga2000

There's no benefit that I know of, besides maybe a tiny cold start boost (since the interpreter doesn't need to generate the bytecode first).

I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.

Qem

Check Nuitka: https://nuitka.net/

hwpythonner

There have been efforts (like Cython, Nuitka, PyPy’s JIT) to accelerate Python by compiling subsets or tracing execution — but none fully replace the standard dynamic model at least as far as I know.

rthomas6

* What HDL did you use to design the processor?

* Could you share the assembly language of the processor?

* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?

hwpythonner

Thanks for the question.

HDL: Verilog

Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining. Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.

Why not ARM/X86/etc... Existing CPUs are optimized for static, register-based compiled languages like C/C++. Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).

pak9rabid

Wow, this is fascinating stuff. Just a side question (and please understand I am not a low-level hardware expert, so pardon me if this is a stupid question): does this arch support any sort of speculative execution, and if so do you have any sort of concerns and/or protections in place against the sort of vulnerabilities that seem to come inherent with that?

hwpythonner

Thanks — and no worries, that’s a great question!

Right now, PyXL runs fully in-order with no speculative execution. This is intentional for a couple of reasons: First, determinism is really important for real-time and embedded systems — avoiding speculative behavior makes timing predictable and eliminates a whole class of side-channel vulnerabilities. Second, PyXL is still at an early stage — the focus right now is on building a clean, efficient architecture that makes sense structurally, without adding complex optimizations like speculation just for the sake of performance.

In the future, if there's a clear real-world need, limited forms of prediction could be considered — but always very carefully to avoid breaking predictability or simplicity.

ammar2

> it includes instructions for stack manipulation, binary operations

Your example contains some integer arithmetic, I'm curious if you've implemented any other Python data types like floats/strings/tuples yet. If you have, how does your ISA handle binary operations for two different types like `1 + 1.0`, is there some sort of dispatch table based on the types on the stack?

kragen

Python the language isn't stack-based, though CPython's bytecode is. You could implement it just as well on top of a register-based instruction set. You may have a point about the other features that make it hard to compile, though.

larusso

This sounds like your ‚arch‘ (sorry don‘t 100% know the correct term here) could potentially also run ruby/js if the toolchain can interpret it into your assembly language?

hwpythonner

Good question — I’m not 100% sure. I'm not an expert on Ruby or JS internals, and I haven’t studied their execution models deeply. But in theory, if the language is stack-based (or can be mapped cleanly onto a stack machine), and if the ISA is broad enough to cover their needs, it could be possible. Right now, PyXL’s ISA is tuned around Python’s patterns — but generalizing it for other languages would definitely be an interesting challenge.

tlb

How do you deal with instructions that iterate through variable amounts of memory, like concatenating strings? Are such instructions interruptible?

Perhaps they don't need to be interruptible if there's no virtual memory.

How does it allocate memory? Malloc and free are pretty complex to do in hardware.

rkagerer

Back when C# came out, I thought for sure someone would make a processor that would natively execute .Net bytecode. Glad to see it finally happened for some language.

kcb

For Java, this was around for a bit https://en.wikipedia.org/wiki/Jazelle.

monocasa

Even better was a complete system rather than a mode for arm processors that ran a subset of the common jvm opcodes.

https://en.wikipedia.org/wiki/PicoJava

varispeed

Didn't some phones have hardware Java execution or does my memory fail me?

Sesse__

It's called Jazelle.

f1shy

Yes. I known at least one Motorola phone which had a co-processor for Java (not Jazelle)

jiehong

Java got that with smart cards for example. Cute oddities of the past

monocasa

JavaCard was just implemented as just a regular interpreter last time I checked.

supportengineer

Does anyone remember the JavaOne ring giveaway?

https://news.ycombinator.com/item?id=8598037

zahlman

In university, for my undergrad thesis, I wanted to do this for a Befunge variant (choosing the character set to simplify instruction decoding). My supervisor insisted on something more practical, though. :(

zahlman

I probably should have added a link: https://esolangs.org/wiki/Befunge

The main thing that appealed to me about this idea is that it would require a two-dimensional program counter. As I recall from the original specification, skipping through blank space is supposed to take O(1) time, but I didn't plan on implementing that. I did, however, imagine a machine with 256x256 bytes of memory, where some 80x25 (or 24?) region was reserved as directly memory-mapped to a character display (and protected at boot by surrounding it with jump instructions).

ComputerGuru

I want to say there was a product that did this circa 2006-2008 but all I’m finding is the .NET Micro Framework and its modern successor the .NET nano Framework.

I’ve been using .NET since 2001 so maybe I have it confused with something else, but at the same time a lot of the web from that era is just gone, so it’s possible something like this did exist but didn’t gain any traction and is now lost to the ether.

duskwuff

There was Netduino, but that was a STM32 microcontroller running an interpreter, not dedicated hardware which directly executed CLR code.

rcorrear

Maybe you’re thinking of Singularity OS?

john-h-k

The tl;dr (I spent lots of time investigating this) is that it just fundamentally isn’t a good bytecode for execution. It’s designed to be small on disk, not hardware friendly

whoomp12342

I'd be surprised if azure app services didn't do this already.

john-h-k

I’d be willing to bet my net worth that they don’t

whoomp12342

then why does azure app services have you pick the .net version?!

actionfromafar

Wouldn't that be a real scoop?

bongodongobob

Azure runs on Linux if I'm not mistaken.

ggiesen

Nope.

sunray2

Very interesting!

What's the fundamental physical limits here? Namely, timing precision, latency and jitter? How fast could PyXL bytecode react to an input?

For info, there is ARTIQ: vaguely similar thing that effectively executes Python code with 'embedded level' performance:

https://m-labs.hk/experiment-control/artiq/

ARTIQ is quite common in quantum physics labs. For that you need very precise and determining timing. Imagine you're interfering two photons as they reach a piece of glass, so that they can interact. It doesn't get faster than photons! That typically means nanosecond timing, sub-microsecond latency.

How ARTIQ does it is also interesting. The Python code is separate from the FPGA which actually executes the logic you want to do. In a hand-wavy way, you're then 'as fast' as the FPGA. How, though? The catch is, you have to get the Python code and FPGA gateware talking to each other, and that's technically difficult and has many gotchas. In comparison, although PyXL isn't as performant, if it makes it simpler for the user, that's a huge win for everyone.

Congrats once again!

sunray2

(minor edit: for observing experimental signatures of photon interference, nanosecond precision is the minimum to see anything when synchronising your experimental bits and pieces, but to see a useful signal needs precision at the 10s of picoseconds! So, beyond what's immediately possible here.)

brcmthrowaway

Did you work at Rigetti?

sunray2

No, didn't work there.

I looked up any connection to ARTIQ they may have: it seems they do full stack QC, as they have their own quantum compiler [1]. But I'm not really sure what they're doing currently.

[1] https://github.com/quil-lang/quilc

froh

Do I get this right? this is an ASIC running a python-specific microcontroller which has python-tailored microcode? and together with that a python bytecode -> microcode compiler plus support infrastructure to get the compiled bytcode to the asic?

fun :-)

but did I get it right?

hwpythonner

You're close: It's currently running on an FPGA (Zynq-7000) — not ASIC yet — but yeah, could be transferable to ASIC (not cheap though :))

It's a custom stack-based hardware processor tailored for executing Python programs directly. Instead of traditional microcode, it uses a Python-specific instruction set (PySM) that hardware executes.

The toolchain compiles Python → CPython Bytecode → PySM Assembly → hardware binary.

cchianel

As someone who did a CPython Bytecode → Java bytecode translator (https://timefold.ai/blog/java-vs-python-speed), I strongly recommend against the CPython Bytecode → PySM Assembly step:

- CPython Bytecode is far from stable; it changes every version, sometimes changing the behaviour of existing bytecodes. As a result, you are pinned to a specific version of Python unless you make multiple translators.

- CPython Bytecode is poorly documented, with some descriptions being misleading/incorrect.

- CPython Bytecode requires restoring the stack on exception, since it keeps a loop iterator on the stack instead of in a local variable.

I recommend instead doing CPython AST → PySM Assembly. CPython AST is significantly more stable.

hwpythonner

Thanks — really appreciate your insights.

You're absolutely right that CPython bytecode changes over time and isn’t perfectly documented — I’ve also had to read the CPython source directly at times because of unclear docs.

That said, I intentionally chose to target bytecode instead of AST at this stage. Adhering to the AST would actually make me more vulnerable to changes in the Python language itself (new syntax, new constructs), whereas bytecode changes are usually contained to VM-level behavior. It also made it much easier early on, because the PyXL compiler behaves more like a simple transpiler — taking known bytecode and mapping it directly to PySM instructions — which made validation and iteration faster.

Either way, some adaptation will always be needed when Python evolves — but my goal is to eventually get to a point where only the compiler (the software part of PyXL) needs updates, while keeping the hardware stable.

nurettin

This was my first thought as well. They will be stuck at a certain python version

bangaladore

Have you considered joining the next tiny tapeout run? This is exactly the type of project I'm sure they would sponsor or try to get to asic.

In case you weren't aware, they give you 200 x 150 um tile on a shared chip. There is then some helper logic to mux between the various projects on the chip.

https://tinytapeout.com/

froh

fascinating :-) how do you do GC/memory management?

relistan

Not an ASIC, it’s running on an FPGA. There is an ARM CPU that bootstraps the FPGA. The rest of what you said is about right.

thenobsta

Amazing work! This is a great project!

Every time I see a project that has a great implementation on an FPGA, I lament the fact that Tabula didn’t make it, a truly innovative and fast FPGA.

<https://en.m.wikipedia.org/wiki/Tabula,_Inc.>

asford

The benchmark results presented in this page are extremely misleading; you're not comparing to the actual baseline gpio performance available in micropython.

Micropython already exposes "viper", which transpiles byte code to machine instructions for highly timing or performance critical code paths. This is reasonably well explained in the micropython docs, which has an example explaining how to ... trigger a gpio and very rapidly.

https://docs.micropython.org/en/latest/reference/speed_pytho...

Viper runs on device and directly emits native machine code for decorated micropython functions. If you have serious timing requirements for gpio, then this is how you do it.

Of course, this is restricted subset of the language compatible with direct native code gen, notably just supporting integer datatypes. However, I would be shocked if this project wasn't also restricted to a subset of the language functionality for your transpilation pipeline.

The benchmark should be rewritten to compare against a baseline in micropython using viper. Though this project is pretty neat, the over inflated performance claims would rapidly deflate against a strong baseline.