Loading Pydantic models from JSON without running out of memory
21 comments
·May 22, 2025m_ke
mbb70
A great feature of pydantic are the validation hooks that let you intercept serialization/deserialization of specific fields and augment behavior.
For example if you are querying a DB that returns a column as a JSON string, trivial with Pydantic to json parse the column are part of deser with an annotation.
Pydantic is definitely slower and not a 'zero cost abstraction', but you do get a lot for it.
jtmcivor
[dead]
itamarst
msgspec is much more memory efficient out of the box, yes. Also quite fast.
jmugan
My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?
not_skynet
going to shamelessly plug my own library here: https://github.com/mivanit/ZANJ
You can have nested dataclasses, as well as specify custom serializers/loaders for things which aren't natively supported by json.
enragedcacti
The only reason I can think of for the behavior you are describing is if one of the unioned types at some level of the hierarchy is equivalent to Dict[str, Any]. My understanding is that Pydantic will explore every option provided recursively and raise a ValidationError if none match but will never just give up and hand you a partially validated object.
Are you able to share a snippet that reproduces what you're seeing?
causasui
You probably want to use Discriminated Unions https://docs.pydantic.dev/latest/concepts/unions/#discrimina...
cbcoutinho
At some point, we have to admit we're asking too much from our tools.
I know nothing about your context, but in what context would a single model need to support so many permutations of a data structure? Just because software can, doesn't mean it should.
shakna
Anything multi-tenant? There's a reason Salesforce is used for so many large organisations. The multi-nesting lets you account for all the descrepancies that come with scale.
Just tracking payments through multiple tax regions will explode the places where things need to be tweaked.
fjasdfas
So are there downsides to just always setting slots=True on all of my python data types?
itamarst
You can't add extra attributes that weren't part of the original dataclass definition:
>>> from dataclasses import dataclass
>>> @dataclass
... class C: pass
...
>>> C().x = 1
>>> @dataclass(slots=True)
... class D: pass
...
>>> D().x = 1
Traceback (most recent call last):
File "<python-input-4>", line 1, in <module>
D().x = 1
^^^^^
AttributeError: 'D' object has no attribute 'x' and no __dict__ for setting new attributes
Most of the time this is not a thing you actually need to do.monomial
I rarely need to dynamically add attributes myself on dataclasses like this but unfortunately this also means things like `@cached_property` won't work because it can't internally cache the method result anywhere.
masklinn
Also some of the introspection stops working e.g. vars().
If you're using dataclasses it's less of an issue because dataclasses.asdict.
zxilly
Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.
itamarst
Once you switch to ijson it will not save any memory, no, because ijson essentially uses zero memory for the parsing. You're just left with the in-memory representation.
dgan
i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods
Automatic, statically typed deserialization is worth the trouble in my opinion
thisguy47
I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.
itamarst
For my PyCon 2025 talk I did this. Video isn't up yet, but slides are here: https://pythonspeed.com/pycon2025/slides/
The linked-from-original-article ijson article was the inspiration for the talk: https://pythonspeed.com/articles/json-memory-streaming/
null
null
Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/