Use Long Options in Scripts

156 comments

·March 21, 2025

wahern

Please DO NOT mix string interpolation and command execution, especially when a command is processed through the shell. Whatever your language, use a list-based or array-based execution API that passes arguments straight through to execv(2), execvp(2), etc, bypassing the shell.

matklad

Was waiting for this comment :P

The API used handles string interpolation correctly: the string literal is parsed at compile time, and the interpolated arguments are never concatenated or escaped, and end up directly as an element of arv array passed to a child. See

https://github.com/tigerbeetle/tigerbeetle/blob/7053ecd2137a...

chubot

This approach creates an odd mini language, which is incomplete:

    comptime assert(std.mem.indexOfScalar(u8, cmd, '\'') == null); // Quoting isn't supported yet.
    comptime assert(std.mem.indexOfScalar(u8, cmd, '"') == null);

But you can do correct interpolation with simple shell variables, rather than generating shell code strings:

    $ today='foo; bar' sh -c 'argv git switch --create "release-$today" origin/main'
    ['git', 'switch', '--create', 'release-foo; bar', 'origin/main']

So that is a test that we can use a plain shell string, without any shell injection bug. (argv is my command to test quoting: python -c 'import sys; print(sys.argv)' "$@" )

Note that there's no escaping function needed, because we're not generating any shell code. We're generating an argv array for `/bin/sh` instead.

---

So by invoking with an env var, you can easily create a correct API that uses plain shell

    git switch --create "release-$today"

rather than

    git switch --create release-{today}  # what language is this?  It's not obvious

If you don't want to use the env var, you can also use

    git switch --create "release-$1"

And invoke with

    ['sh', '-c', shell_string, 'unused-arg0', today_string]

With this approach, you don't need

    1. any kind of shell escaping
    2. any analyzing of pseudo-shell strings, which can't contain quotes

Because you are not generating any shell code. The shell code is constant.

latexr

Why would they even change the language and the commands in the example? It confuses and undermines the point. Just say “use `git switch -c my-new-branch` for interactive usage and `git switch --create my-new-branch` in scripts”. It makes no sense to introduce other unexplained information.

gray_-_wolf

Another approach is to have powerful enough language that allows you to guard against the shell injection. I wrote a syntax form allowing to do this:

    (sh "cat " file " >" output)

With file being bound to "foo'bar" and output to "x", it is automatically translated into

    cat 'foo'\''bar' >'x'

This gives you the flexibility to use shell (sometimes it just is the most concise way) while being safe against injection.

I believe for example in rust you should be able to do the same.

delusional

How do you know which shell you're escaping for? You could query the system, but now you end up implementing escaping for every shell out there.

gray_-_wolf

Good question. I care only about POSIX compatible shells, so the escaping just follows the POSIX rules. In practice that means it works on any actually used system except windows, which is fine with me.

gorgoiler

Miniature, in-line sh scripts are also fine as long as you use the provided parameter substitution.

If you’re averse to this:

  q(“select x where y = ‘“ + v + “‘“)

And instead do this:

  q(“select x where y = %s”, v)

Then you should be averse to this:

  x(“foo --option ‘“ + v + “‘“)

And instead do this:

  x(‘foo --option “$1”’, v)

This is particularly useful when it’s expedient to have one thing piping into another. Like it or not the sh DSL for pipes is excellent compared to doing things natively with execve() and pipe(), just as doing group by and count is far more concise in SQL than doing so natively.

Most SQL libraries give you something like q. Writing your own x is as simple as calling sh correctly. In Python, for example:

  def x(script, *args):
    run([“sh”, “-c”, script, “--“, *args])

kbenson

Neither of those are equivalent to variable binding, which is what most SQL libraries provide, specifically because they don't actually solve the problem since they're still doing string substitution. Putting a double quotes in $1 in your "good" execute example will allow you break out of what's expected and then you're Bobby Tables.

Your python example at the bottom is correct, in that each separate element is more correct in that it allows each arg to be passed as an element, so there's no option to break out through quoting characters. SQL binds are like that in most ljbraries, even if they don't look like it. The parser knows a single item below there so if it passes it along as such. You cannot escape it in the same way.

gorgoiler

I don’t really follow. My “good” example and the code at the bottom are the same.

sh is smarter than just doing string interpolation and ”$1” is passed on as a single argument, no matter what:

  > run(["sh", "-c", 'echo "$1"', "--", 'a"'])
  a”

Whereas if it were simple string interpolation, you’d see this:

  > run(["sh", "-c", 'echo "a""')
  --: 1: Syntax error: Unterminated quoted string

It’s the same special casing that gets "$@" right.

pwdisswordfishz

    SyntaxError: invalid character '“' (U+201C)

crazygringo

For anything involving file paths, user input, etc. -- yes of course. It's not even a question because they would need to be escaped otherwise which nobody wants to do.

But for a simple example like this where it's inserting a date which has known properties, it seems fine, and is much more readable.

paulddraper

Tbf this input does not need escaping.

But at the very least the shell is unnecessary here.

tasuki

Why not?

bulatb

Any time you send commands and data down a single channel, user input that's intended to be data can be misinterpreted as a command. For example, if your program wants to:

    run("program --option '{user_input}' > file")

to save some input to a file, and the user's input is:

    '; bad_command #

then when run() sends that string to the shell, the shell will run:

    program --option '';
    bad_command #' > file

Most languages have something like a safe_exec() that separates the shape of the command from the values of the options, executing "program" with the options and the user_input in the arguments array as data. Skipping the shell step, which would just be building an exec call anyway, removes the opportunity for users to confuse it into doing something else.

The list-based API alternative they recommend might look like this:

    safe_exec(["program", "--option", user_input], stdout="file")

and it would always exec "program" with argv[1] == "--option" && argv[2] == user_input. If the user_input happens to be:

    '; bad_command #

...well, then, the user can enjoy the contents of their file.

tasuki

Yes of course. But why would you expect me to run shell commands with random person's input? Also:

    safe_exec(["rm", user_input])

This isn't safe either! Despite clearly saying "safe_exec"!

null

[deleted]

echelon

SQL injection on steroids.

rat87

Only if you are getting input from untrusted users

remus

imo it's best to just avoid it altogether. Requirements change, and what was once a trusted input can become untrusted input.

brookst

Or if your trusted users are fallible and could be tricked into providing unsafe inputs.

susam

I prefer long options too. However, while writing programs that need to invoke POSIX commands in a portable manner, short options are the only viable choice, as POSIX doesn't specify long options. For instance, see the specification for diff at <https://pubs.opengroup.org/onlinepubs/9799919799/utilities/d...>, or that of any POSIX utility listed at <https://pubs.opengroup.org/onlinepubs/9799919799/idx/utiliti...>.

That said, this is more of a corner case. In most scenarios, rather than relying on POSIX utilities, there are often better alternatives, such as using library bindings instead of spawning external processes. For example, instead of invoking grep, using something like libpcre could be a more efficient choice.

For non-POSIX utilities like git, hg, rg, ag, etc., using long options makes perfect sense.

Wowfunhappy

> However, while writing programs that need to invoke POSIX commands in a portable manner

...probably a stupid question, but something I have earnestly been wondering about... when does this actually happen nowadays? What POSIX systems are you targeting that aren't one of the major ones (Linux, Darwin, or one of the major BSDs)?

I was writing a shell script a few months ago that I wanted to be very durable, and I targeted sh instead of bash just because, well, it seemed like the correct hacker spirit thing to do... but I don't actually know what system in the past decade (or more) wouldn't have bash.

chrisweekly

I ~recently had to wrestle w/ TeamCity (CICD) whose build agents provide only sh. I needed to invoke a 3rd-party util that required bash. The resulting "bash in dash in docker in docker" worked, but I wasn't thrilled about the convoluted / Frankenstein setup.

em500

> ... but I don't actually know what system in the past decade (or more) wouldn't have bash.

There's some ambiguity about "have bash". If "having" bash means that (some version of) bash has been ported to the system, there are indeed very few. If "having" means that bash (supporting all options that you need) is available to the user, that could be a lot more. As others have noted, the BSDs, Android and many embedded Linux systems don't come with bash pre-installed, MacOS pre-installed bash is stuck at version 3.2 (which doesn't have associative arrays), and the user could be in an environment that does not allow them to install whatever they need.

mceachen

Alpine docker images only come with dash instead of bash, which _may_ run your sh script, but test thoroughly. Or just install bash.

FWIW, Darwin/macOS is especially guilty of gobsmackingly ancient coreutils that don’t support long option variants.

samatman

Is it? I'm with you on gobsmackingly ancient, but it's "doesn't support long options" which I haven't bumped into. I do replace some coreutils, but not all of them.

What's a good example of such a utility?

dfe

macOS doesn't have GNU coreutils at all. It has the utils from FreeBSD.

The gobsmackingly ancient GNU software it does have is bash, because it's the last version under GPL 2. I've used Mac OS X since 10.1, so I remember when the default shell was tcsh and /bin/sh was not bash.

That's (basically) the case again on the last few macOS releases. Today, zsh is my shell of choice, including on Linux.

pingiun

the alpine default shell is called "ash", "dash" is the debian/ubuntu default shell

hulitu

P in POSIX stands for portability. ,/s

cmgbhm

Where I can sometimes get burnt is busybox.

I more often get burnt in zsh to bash than that however

susam

> I don't actually know what system in the past decade (or more) wouldn't have bash.

I have written a bit more about it in these comments:

https://news.ycombinator.com/item?id=40681382

https://news.ycombinator.com/item?id=17074163

Gud

FreeBSD doesn’t come with bash though.

vermaden

But it also have drawbacks :)

But being honest - you can install BASH that way:

# pkg install -y bash

justaj

Nor does OpenBSD for that matter.

vsl

Major ones are enough. Linux and Darwin (that is, macOS and GNU userspace, really) differ sufficiently that you need to pay attention or limit yourself to POSIX. E.g. sed and wc burned me a few times with scripts that need to run on both.

Someone

> but I don't actually know what system in the past decade (or more) wouldn't have bash.

I think MacOS still has bash, so that it, technically, doesn’t count, but it doesn’t have a bash from the past decade, and uses zsh by default.

theamk

Note that grep in particular is extremely optimized.. If you have multi-gigabyte files, and you only search for one thing, shelling out to grep will likely have much better performance that doing it yourself.

But not every system needs that much, and in a lot of cases, using your language's regexp library will be more robust anf easier to write.

dosourcenotcode

Agree that long options should be used. But there is one caveat to consider: portability.

Sadly to this day not all BSD distributions have GNU style long options. And the ones that now do only got them fairly recently. So if you want portability you have to use short options as you weep with a bottle of vodka in hand.

mplanchard

Not trying to spam this thread with praises of nix, because it does have its own problems, but it certainly solves the portability problem.

Four years in to using it at work for dev environments across mac (x86 & ARM) and various linuxes and can’t imagine going back. I also always make dev environment definitions for my open source projects, so even if people aren’t using nix, there is at least a record of what tools they will need to install to run scripts, tests, etc.

nine_k

Does nix work well on BSD-derived Unices? In particular, the most widespread of them, macOS?

mplanchard

Yes, works great on Mac. About half our engineers us Macs, the other half Linux. We have one nix configuration for the dev environment, which works for everyone.

saghm

This surprises me because the first case I remember ever coming across where short versus long options impacted portability across GNU and BSD was _fixed_ by using long options. Maybe six years ago or so I had an issue porting a script someone else had written for use in CI that happened to decode some base64 data that failed when I tried to use it on a different platform. I forget which one it was originally written for and which one I was trying to use it on, but the issue boiled down to the MacOS version of base64 using the BSD short option for decode and Linux using the GNU one, and they each used a different capitalization; one used `-d` and the other used `-D` (although I also can't remember which used which honestly). My solution was to use the long option `--decode`, which was the same on both of them, and since then the times I've needed to decode base64 I've always used the long option out of habit, which probably explains why I can't remember what option Linux uses despite it being the one I've used far more over the years since then.

delusional

I think the right way to think about this (if your goal is to avoid surprises at least) is that options (short or long) are just strings. There's no guarantee that there's a long variant of an option. There's not even a requirement that options start with a dash. A sufficiently brain-damaged developer could start them with a slash or something.

If you're going for portability the best bet is to just read the manual for each of the separate versions and do whatever works.

sgarland

To this day, I write tar options with no dash, simply because I can. `tar cvzf foo.tar.gz ./foo`

I would never write a new program with this option, but I do find it a delightful historical oddity.

saghm

That sounds reasonable to me. If anything, I might even go further and say that reading the manuals wouldn't be enough to fully convince me without also actually testing it by running a script on a given platform. It's not that I don't trust the manuals to be right, but I have less trust in myself to write bug-free code than probably any other language I've ever used, and I don't think I'd feel confident without verifying that I actually did what the manual said correctly.

teddyh

Also, do not forget using “--” after all options, but before any dynamic arguments, just to be safe.

arcanemachiner

I know to do this intuitively, but I have no idea why.

hoherd

It terminates argument parsing, so anything following it that starts with a hyphen will not be treated as an argument.

    $ echo 'hack the planet' > --help
    $ cat --help
    cat: illegal option -- -
    usage: cat [-belnstuv] [file ...]
    $ cat -- --help
    hack the planet
    $ rm -vf --help
    rm: illegal option -- -
    usage: rm [-f | -i] [-dIPRrvWx] file ...
           unlink [--] file
    $ rm -vf -- --help
    --help
    $ cat -- --help
    cat: --help: No such file or directory

hiAndrewQuinn

My eyes have been opened. Thank you!

less_less

It tells the shell utility that any remaining arguments are not options, but instead files or whatever the script might process. You know, in case someone makes a file called -rf.

pletnes

But not all shell utilities follow this particular convention

bluedino

It's worth it just to watch the frustration of a junior when they try tacking more arguments on the end of a command.

account42

A great opportunity to teach the importance of reading the whole command before trying to modify it.

saagarjha

Unfortunately, if you want your scripts to be portable to other POSIX systems you might have to use the short options, as the long ones are not standardized. You have to decide the tradeoff for yourself.

mplanchard

Using nix has really spoiled me on this. Everyone gets the same versions of all the CLI utilities in the dev environment, whether on mac or linux, and those are the same versions that run in CI and any prod systems. It’s really nice being able to use whichever newer bash features or gawk extensions you like, without having to deal with trying to ensuring the mac engineers have brew-installed all the right stuff to match a standard linux env.

delusional

nix didn't solve your issue here. nix didn't do anything. You're just describing the benefit of a reproducible development environment. You could do the same thing with brew, pacman, apt, or by just compiling every package from source from some huge mirror.

It's exactly the same thing people initially loved about docker or vagrant.

mplanchard

Sure, but it works on Mac and Linux and doesn’t require virtualization. I think brew might qualify, but it can’t define which environment variables should be available in the developer shell or which hooks to run upon entry.

I don’t think any of the other options you specified can manage the same thing.

paulddraper

Everyone has to use nix :)

But yes, that is nice.

mplanchard

That is the caveat. I initially set it up such that it wasn’t required: you could choose to use it if you wanted to, and otherwise here is a list of specific versions things you must install, etc. Everyone ultimately chose to use nix, and now it’s required. Makes for a pretty easy setup though for new devs: install nix, then run `nix develop`, then `make setup`, and you’re off to the races.

pcwalton

What POSIX systems in actual use (not historical Unixes) don't have the long options? macOS' BSD utilities I guess?

yjftsjthsd-h

> What POSIX systems in actual use (not historical Unixes) don't have the long options?

All of them except for GNU, AFAICT? (That is, only GNU seems to have long options.) Checking manpages for rm(1) as a simple reference, I can't see long options in any of the 3 major BSDs or illumos, and checking Alpine Linux seems to show busybox also only doing short options (sorry, can't find an online doc for this, though it's easy to check in docker if you don't have a machine running Alpine handy). OpenWRT also uses busybox and has the same (lack of) options.

https://man.netbsd.org/rm.1

https://man.openbsd.org/rm.1

https://man.freebsd.org/cgi/man.cgi?query=rm&apropos=0&sekti...

https://illumos.org/man/1/rm

SoftTalker

More than this, the gnu utilities often have options that don't exist at all on other platforms, in either long or short form.

saagarjha

Yes or like BusyBox

schneems

You can also brew install tools like gnused which have the same arguments. Not a viable option for all situations but if you just need to execute it on Linux and your local machine for dev you can use those.

null

[deleted]

ratrocket

I agree with this practice. Another benefit is it makes it easier (slightly, but still) to grep the man page for what the options do.

The corollary must be "write programs that take long options".

starkparker

And put them on separate lines so you can track and git blame them more easily.

ivanjermakov

Same line git blame is not that hard, just list commits affecting specific file or even specific line span: https://git-scm.com/docs/git-log#Documentation/git-log.txt--...

amelius

Before invoking a command, always first check if the length of the command is not longer than ARG_MAX. For example, if this is your command:

    grep --ignore-case --files-with-matches -- "hello" *.c

Then invoke it as follows:

    CMD="grep --ignore-case --files-with-matches -- \"hello\" *.c"
    ARG_MAX=$(getconf ARG_MAX)
    CMD_LEN=${#CMD}

    if (( CMD_LEN > ARG_MAX )); then
        echo "Error: Command length ($CMD_LEN) exceeds ARG_MAX ($ARG_MAX)." >&2
        exit 1
    fi

    eval "$CMD" # warning, evaluates filenames

mhitza

That might be sensible, but also obscure the script logic.

Since using Linux exclusively, I don't think I've ever encountered an issue due to too many arguments/length. And it's the first time I'm actively searching online for ARG_MAX.

I understand that different shells might be different, but with reasonable lengths is there any chance of it being relevant (aside from xargs, where it's generally intended, or better, to pass along each argument individually).

amelius

I started running into these issues when I started working with training examples in the context of deep learning. A folder with millions of files is then not unheard of.

Also if you do things like */*/*, then you can quickly get large command lines. Or even if you do long_name/another_long_name/*.

wodenokoto

I've had problems with "cat *.csv" plenty of times processing data that is generated in many small files.

It is really difficult to deal with, because on top of the arg max limit, globs are not guaranteed to be in order.

The solution is not obvious and hard to get to if you don't know the foot guns in advanced and hard to read once implemented.

apgwoz

You should always type check your shell scripts as well. For example, you just:

    $ shelltypes script.sh
    # Welcome to shelltypes v 3.23.2
    # type ‘help’ if you’re stuck
    >>> {1) # import POSIX;;
    Importing 73 items.
    >>> {2} # append loadpath “/opt/local/shelltypes/base”;;
    >>> {3} # import base::YOURPROJECT;;
    Importing 15 items.
    >>> {4} # check “YOURSCRIPT.sh”
    Parsing YOURSCRIPT.sh.
    Reticulating splines.
    Expanding aliases.
    Analyzing free shell environment variables.
    Found inconsistencies in PATH.
    Warning: Low battery!!!
    Warning: found free type for ‘shred’, ignoring.
    Warning: use of sudo requires password under /etc/sudoers.
    Warning: this utility is fake.
    Error: use of cat impossible in the presence of mutt.
    Found 15 errors.
    Try again. Goodbye.
    $

Then you can be pretty sure your script isn’t going to do unnecessary harm, and has some proper guardrails in place.

Timon3

Where does "shelltypes" come from? I can't find anything on DuckDuckGo or Google, but this seems like it would be very useful.

Arrowmaster

From the output in the post I'm going to assume it's either a joke post or an LLM hallucination.

null

[deleted]

amelius

I don't know shelltypes, and it sounds like a linter for shell scripts.

Does shelltypes warn against a failure to check for ARG_MAX?

p_wood

> eval "$CMD"

That means you will eval all the filenames, so if you have a file with spaces in it will appear as two files, if there is a `$` in the name it will trigger parameter substitution and so on for the other shell meta-characters.

amelius

Yes, that could be true. I'm not great in Bash. Be careful. These types of error are why I don't use Bash. I just wanted to give an example in a commonly used scripting language. The main point here is to check ARG_MAX.

account42

If you are doing something where going over ARG_MAX is a real possibility it would be better to write your scripts to avoid the problem altogether rather than awkwardly try to detect it wit ha bonus exploit. For example, many commands can accept a list of files on standard input or from a list file.

_huayra_

On my system, `getconf ARG_MAX` is over 2m.

I have seen some heinously long cmdline strings, but nothing close to that. Usually when invocations have creeped up into the O(1k) character limit at places I've worked, I've implemented a "yaml as cmdline args" option to just pass a config file instead.

Have you seen scenarios where this is actually limiting?

amelius

Yes, if *.c expands to a string over 2m. Maybe that is a lot for .c files, but it may easily happen with .tiff and a folder full of images used for training a deep learning model, for example.

_huayra_

Thanks, this is interesting. I have done a lot of this sorta stuff (glob expanding giant directories) without a thought for this `ARG_MAX` libc parameter, but now I know I need to keep it in mind!

hulitu

> Before invoking a command, always first check if the length of the command is not longer than ARG_MAX.

Tell that to Google and Mozilla. /s

croes

> Long form options are much more self-explanatory for the reader.

And less prone to typos

dosourcenotcode

And not just options but base command names too. I wrote a tool to partially mitigate this in some cases: https://github.com/makesourcenotcode/name-safe-in-bash

sebastianmestre

You have a nice ternary counter going in the version numbers :)

ndegruchy

This is one of my default rules for writing scripts. If the long option is available, use it. It makes too much sense to do so.

gabrielsroka

jFriedensreich

The hype side of this is to always add instructions to use long options to the agent base prompt. Its much easier to spot mistakes and long options have the advantage to not do something completely different if they are wrong.

HN

Use Long Options in Scripts

Use Long Options in Scripts