Skip to content(if available)orjump to list(if available)

A deep dive into Linux's new mseal syscall

ykonstant

Interesting. The article mentions "spicy discussions" in the kernel mailing list. Is there any insider who can summarize objections and concerns? I tend to avoid reading the mailing list itself since it can get too spicy, and my headaches are already strong enough!

The mechanism itself seems reasonable, but I am surprised that something like this doesn't already exist in the kernel.

ziddoap

Not sure if there was much more to it than the thread linked to, but it was basically Linus being Linus. He said stuff that made sense in a pretty blunt fashion.

There were flags proposed that allowed the seal to be ignored.

>So you say "we can't munmap in this *one* place, but all others ignore the sealing".

Later was the spice.

>And dammit, once something is sealed, it is SEALED. None of this crazy "one place honors the sealing, random other places do not".

And later, even spicier, Linus says that seals cannot be ignored and that is non-negotiable. Any further suggestions to ignore a seal via a flag would result in the person being added to Linus' ignore list. (He, of course, said this with some profanities and capitals sprinkled in.)

js2

Wasn't just Linus. Earlier, from Theo de Raadt:

> I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.

https://lwn.net/ml/linux-kernel/95482.1697587015@cvs.openbsd...

Via https://lwn.net/Articles/948129/

Affric

Thank you.

That was beautiful.

Demonstrated the difference in design/engineering philosophies from two of the greats.

benreesman

You’re asking me how a watch works, let’s just try to keep an eye on the time.

https://youtu.be/vkYqs9iuJqY?feature=shared&t=109

null

[deleted]

0xbadcafebee

Not a great perspective... "It took me a year [or more] to understand this. The fact that you don't understand it shocks me." Dude, not everybody's as smart or experienced as you. Here's an opportunity to be a mentor.

santiagobasulto

Is that considered "spicy"? Is the sensitivity threshold maybe too low?

f1shy

Extremely too low. My personal opinion: people who cannot take that kind of criticism, has no place in such a project. Period.

f1shy

And the reason why SW development sucks in enterprise, is the lack of people that can speak clearly like Linus.

greenavocado

https://lwn.net/ml/linux-kernel/7071.1697661373@cvs.openbsd....

    From:   Theo de Raadt <deraadt-AT-openbsd.org>
    To:   Jeff Xu <jeffxu-AT-google.com>

    > On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
    > >
    > > Let's start with the purpose.  The point of mimmutable/mseal/whatever is
    > > to fix the mapping of an address range to its underlying object, be it
    > > a particular file mapping or anonymous memory.  After the call succeeds,
    > > it must not be possible to make any address in that virtual range point
    > > into any other object.
    > >
    > > The secondary purpose is to lock down permissions on that range.
    > > Possibly to fix them where they are, possibly to allow RW->RO transitions.
    > >
    > > With those purposes in mind, you should be able to deduce for any syscall
    > > or any madvise(), ... whether it should be allowed.
    > >
    > I got it.
    > 
    > IMO: The approaches mimmutable() and mseal() took are different, but
    > we all want to seal the memory from attackers and make the linux
    > application safer.

    I think you are building mseal for chrome, and chrome alone.

    I do not think this will work out for the rest of the application space
    because

    1) it is too complicated
    2) experience with mimmutable() says that applications don't do any of it
    themselves, it is all in execve(), libc initialization, and ld.so.
    You don't strike me as an execve, libc, or ld.so developer.

greenavocado

    From:   Matthew Wilcox <willy-AT-infradead.org>
    To:   Jeff Xu <jeffxu-AT-google.com>

    ...

    Yes, thank you for demonstrating that you have no idea what you need to
    block.

    > It is practical to keep syscall extentable, when the business logic is the same.

    I concur with Theo & Linus.  You don't know what you're doing.  I think
    the underlying idea of mimmutable() is good, but how you've split it up
    and how you've implemented it is terrible.

    ...

lathiat

ykonstant

Very nice, thanks!

Edit: I always find it funny that these articles on the mailing list tend to read like a sports announcer describing a boxing match!

MBCook

A question about using this call:

Chrome is the one who wants it. But you can’t unmap sealed pages because an attacker could then re-map them with different flags.

So that basically means this can never be used on pages allocated at runtime unless you intend to hold them for the entire process lifetime, right?

Doesn’t that mean it can’t be used for all the memory used by, say, the JS sandbox which would be a very very tempting target?

Or is the idea that you deal with this by always running that kind of stuff in a different process where you can seal the memory and then you can just kill the process when you’re done?

I’m not familiar with how Chrome manages memory/processes, so I’m not exactly sure why this wouldn’t be an issue.

Is this also the reason why the articles about this often mention it’s not useful to most programs (outside of how memory is set up at processes start up)?

PhilipRoman

>Doesn’t that mean it can’t be used for all the memory used by, say, the JS sandbox which would be a very very tempting target?

Multiprocessing is an option here. I think chrome uses it extensively, so that might be the play here. You need separate processes for other stuff anyway, like isolation via namespaces.

masklinn

Yes, in fact in his comments Theo de Raadt specifically says (amongst other things):

> experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so.

So this is almost never something a process does to itself, it is part of the sandboxing of child processes.

throw0101a

mseal() and what comes after, October 20, 2023: https://lwn.net/Articles/948129/

mseal() gets closer, January 19, 2024: https://lwn.net/Articles/958438/

Memory sealing for the GNU C Library, June 12, 2024: https://lwn.net/Articles/978010/

sim7c00

i am sad operating systems need to have such calls implemented while most modern (x86_64) architectures have so many features to facilitate safe and sound programming and computing. legacy crap en mentality , and trying to patch old systems built on paradigms not matching the current world and knowledge rather than rebuilding really put a break on progress in computing, and put litterally billions at risk.

not to say these things arent steps in the right direction, but if you let go of current ideals on how operating systems work, and take into account current systems, knowledge about them, and knowledge about what people want from systems, you can envision systems free from the burden and risks put on developers and users today.

yes architecture bugs exist, but software hardly takes advantage of current features truly,so arguing about architectural bugs is a moot point. theres cheaper ways to compromise, and always will be if things are built on shaky foundations

GolDDranks

Elighten me: what unused/underused safety features x86_64 has that wouldn't require the OS to have some method of using or enabling them? Why do you think mseal isn't warranted and what would be better instead?

gcr

i read the above poster's critique about operating system and API design more generally. x86-64 can do wonderful things with memory access paradigms, why must we keep using Linux and its baked-in assumptions about how memory should work? Let's instead rewrite everything to be memory-safe, with safety enforced by everything we've learned in the last 60 years of architecture design.

That's what I think the parent post is saying. (I personally gently disagree)

pjc50

Such as what, though?

metadat

Will it be possible to override / disable the `mseal' syscall with the LD_PRELOAD trick?

eska

mseal digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.

If a remote attacker can change the local environment then they must have already broken into your system.

gcr

Not necessarily. By posting this comment, I have caused "THIS STRING IS HARMFUL" to enter your computer's memory! If you see my comment on your screen, it's too late -- as a remote attacker, I have already changed the local environment! I've even slightly changed the rendering of the webpage you're looking at! Muahahah!

The point is that "The local environment" could refer to what's inside the sandbox. Your browser isn't going to treat my comment as x86 machine code and execute it, for example. Javascript is heavily sandboxed, and mseal() and friends are ways to add another layer of sandboxing.

rowanG077

The poster obviously meant environment variables as in the LD_PRELOAD variable mentioned previously...

null

[deleted]

Dwedit

Probably not LD_PRELOAD. It would need to be an imported function in order for LD_PRELOAD to have any effect. A raw syscall would not be interceptable that way.

Discussion about intercepting linux syscalls: https://stackoverflow.com/questions/69859/how-could-i-interc...

But building your own patched kernel that pretends that mseal works would be the simplest way to "disable" that feature. Programs that use mseal could still do sanity checks to see if mseal actually works or not. Then a compromised kernel would need secret ways to disable mseal after it has been applied, to stop the apps from checking for a non-functional mseal.

jandrese

I'm not sure what protection you could expect on any system where the kernel has been replaced by the attacker. Sure they can bypass mseal, but they are also bypassing all other security on the box.

Dwedit

Two different considerations for when you'd want to deny memory to other processes:

Protecting against outside attackers

Digital Rights Management

Faking "mseal" is something you might intentionally do if you are trying to break DRM, and something you would not want to do if you are trying to defend against outside attackers.

monocasa

There's a bunch of ways to override it if you have early control over the process. Another example: ptrace the executable, watch the system calls, and skip over any mseal(2)s.

This system call is meant for a different threat model than "attacker has early access to your process before it started initializing".

chucky_z

You can override the mseal call wrapper but not the syscall itself.

This is an interesting thought so I looked it up and this is how (all?) preload syscall overrides work. You override the wrapper but not the syscalls itself so if you’re doing direct syscalls I don’t think that can be overridden. Technically you could override the syscall function itself maybe?

jmmv

> Technically you could override the syscall function itself maybe?

But then you can just write assembly code to issue the system call.

the8472

https://lwn.net/Articles/978010/ says there'll be a glibc tunable

cataphract

Depends whether the program calls into libc or inlines the syscalls, I imagine. Though you could use other mechanisms like secccomp.

unwind

Meta: the mseal() prototype in the article needs some editing, it is not syntacticallly correct as shown now. The first argument is shown as

    unsigned start addr
But should probably be

    unsigned long start_addr

hifromwork

Seems to be OK now:

    int mseal(unsigned long start, size_t len, unsigned long flags)

Iwan-Zotow

should be size_t

westurner

- "Memory Sealing "Mseal" System Call Merged for Linux 6.10" (2024) https://news.ycombinator.com/item?id=40474510#40474551 :

> How should CPython support the mseal() syscall?

xterminator

OpenBSD has had it since forever [1]. Why is such an obvious feature only reaching Linux now?

[1]https://man.openbsd.org/mimmutable.2

gilgamesh3

>OpenBSD has had it since forever.

OpenBSD introduced mimmutable in OpenBSD 7.3, which was released 10/4/2023 (for US people, it would be 4/10/2023), so it isn't "forever".

Meanwhile Linux and FreeBSD has "memfd_create" forever while OpenBSD doesn't have anonymous files and relies on "shm_open".

pushupentry1219

> OpenBSD introduced mimmutable in OpenBSD 7.3

Correct but they did have a very similar syscall for a long time that they deprecated after the release of mimmutable iirc

null

[deleted]