Erlkönig: Commandname Extensions Considered Harmful

Created: 2016-01-23

Some complementary material on interpreter directives
and command name issues can be found in Wikipedia

You can also read the email in which Dennis Ritchie introduced #!.

ABSTRACT

The command name for any Unix script must be stable for any complex system based on it to be stable. However, this is being compromised through practices based on misinformation. † This paper explores how scripts are actually run, how naming affects correctness and stability, and various common misconceptions in order to clarify the reasons behind standard practice - which is:

Command names should never have filename extensions.

Command name extensions have numerous issues:

They unnecessarily expose implementation detail (breaking encapsulation).
They uselessly and incompletely mimic detail from the #! line.
They capture insufficient detail to be useful at the system level (and aren't used).
They clash with recommended Unix (and Linux) practice.
They add noise to the command-level API.
They are very commonly technically incorrect for the script.
They give incorrect impressions about the use of the files they adorn.
They aren't validated even for what little info is present in them.
They interfere with switching scripting languages.
They interfere with changing scripting language versions.
They interfere with changing to presumably-faster compiled forms.
They encourage naïvely running scripts with the extension-implied interpreter.
They infect novice scripters with misinformation about Unix scripting.

Ironically all of these are problems involving interpretation by humans.

Herein, a problem with filename extensions is described in a manner perhaps more pragmatic than, yet inspired by, the well known Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968). Dijkstra's work addresses the issue of how the use of the go to statement largely abridges the ability to parametrically describe the progress of a process, engendering an unnecessary impediment to the code's clarity and manageability. This new document details, based on practical experience under Unix-like operating systems, how filename extensions, particularly but not limited to those files implementing commands, create a secondary set of semantic tags in the interfaces between programs which are demonstrably both superfluous and treacherous.

It's not a coincidence that in both Dijkstra's plaint and this one that computers are not at all affected by either practice - it's entirely a problem for just the humans.

What is a Command

For purposes of this paper, command names are the filenames of all the executable files in the directories in the Unix $PATH environment variable.

By convention, almost all such directories end in bin (nominally suggestive for tool bin and not restricted to binaries), with sbin (system bin), and games also being common. Historically etc or lib occasionally appeared in $PATH, but this has become increasingly rare since the 1980s.

Didactic Examples

Consider the following examples, in which files have .sh and .py extension, ostensibly to indicate the type of the file as well as to make it easy to list all files of the same type (shell scripts). Running them based on the apparently-correct interpreter doesn't go well (the […] means truncated for brevity):

$ sh frob.sh ./frob.sh: 2: Syntax error: # […] what? $ ./frob.sh hello world # ✔ $ cat frob.sh #!/usr/bin/perl -w # Used to be a shell script, # but we couldn't change the name... use strict; printf("hello world\n"); $ $ python knob.py File "knob.py", line 2 func = print ^ SyntaxError: invalid syntax # what? $ ./knob.py hello world # ✔ $ cat knob.py #!/usr/bin/python3 func = print func("hello world") $ $ sh qux.sh qux.sh: 3: qux.sh: Syntax error: # […] what? $ ./qux.sh hello world # ✔ $ cat frob.sh #!/bin/bash cat <<<'hello World' $

Real World Examples

Statistics

All commands in $PATH a sample Linux host:

2700 commands
872 scripts (32%)
13 improper extensions

Only 1.5% of command scripts in Linux have improper extensions. Most have no extension at all.

Text files and .txt

It's fine, they're not commands.

index.html

Required, along with others like .shtml and .cgi, to determine handling by Apache.

/etc/profile.d/bash_completion.sh

Correct. It's not a command. It's read into a running shell with the . command.

/usr/bin/pip (no ext)

Correct. It's totally specific to python users, and written in python, but it's a command line command and correctly doesn't have a .py on the end.

…/venv/bin/activate

WRONG. This virtualenv script needs .sh, since it's not runnable as a command (no execute permission, wouldn't work as a command), and is expected to be read into a shell with ".". Amusingly they got it right for the activate.csh in the same directory.

/usr/bin/gettext.sh

Correct. The placement in /usr/bin is a bit odd, but it's used by the Bourne shell where . crawls $PATH to find files like this, allowing . gettext.sh

/usr/bin/miniterm.py

WRONG. This *is* supposed to be a command, and it even has the #!/usr/bin/python interpreter directive at the top, so... fail. (Same for the 5 pil*.py commands)

/etc/smartd_warning.sh

WRONG (it's a runnable script)

/usr/bin/nvidia-bug-report.sh

WRONG (not intended for use by . + $PATH)

/usr/bin/bison.yacc

An exception: Here, the .yacc just means bison's implementation of yacc.

/usr/bin/collateindex.pl

WRONG. Same for its friends function_grep.pl and isohybrid.pl. It's understandable that such might have started as a library, then added __main__ but the extension is still wrong (see the Theory section in the Python appendix).

These scripts show some problems around trusting extensions:

The left example is typical when the code has been later reimplemented (here from sh to perl) but the filename was left unaltered for backwards compatibility.
The right example, which can occur either the same way as the one just mentioned or through a number of other reasons (python3 was used where python2 was still the system default, etc), show how a commandname extension doesn't capture enough information to be useful, but instead breeds false confidence.
The bottom example shows that even for shell scripts, the Bourne shell is a subset of Bourne-Again syntax, yet people often use .sh where only .bash would have made any sense, and only for .-ables (function libraries and support scripts, not programs).

Failures aren't always as obvious as immediately exiting with an error: More subtle distinctions in script language execution, or a script with sufficient error trapping to survive being run with the wrong interpreter version, could result in incorrect results and serious damage.

$ python divide.py 5 2 2 # what? $ ./divide.py 5 2 2.5 # ✔ $ cat divide.py #!/usr/bin/python3 import sys a, b = int(sys.argv[1]), int(sys.argv[2]) print(a / b) $

(The same issue can arise through command search in $PATH finding a different version of a program than expected, especially when using virtual environments, but that's outside of the scope of this document)

Methods of Specifying Interpreters for Scripts

Several mechanisms exist to determine how a file should be executed, whether as a set of directives or as machine code. The ones relating to this discussion are:

A process can let the the kernel (in the exec call) use the magic number found at the start of a file and usually followed by additional data to determine details of the executable format of the file. Magic numbers are usually found at the beginning of files, encoding whether a file contains a compiled program, dynamically-loadable code, or other data in a known binary format, such as images, audio data, and so forth. Typical examples are compiled programs, which generally lack the extensions that were attached to the source files from which they were compiled.
A process can elect to treat a file specially based upon a file extension, which provides a small subset of the magic number functions, but allows for data within the file to omit having any kind of recognizable in-band header. In most common Unix filesystems a filename isn't exactly file metadata, but instead part of each directory linking to the file, and can be easily set to disagree with the file's internal data. Unix also leaves handling them to be a userspace function which varies widely between different applications, has no direct kernel support, and is most common on file type like audio, images, video, compiled programming language files, etc.
A process can elect to use arbitrary filetype association to a file with the run of some other program, as when a shell user runs some interpreter program by name with an explicitly specified data source, such as using sh explicitly to read in and run a shell script. With most interpreters, this will cause both filename extensions and magic numbers to be ignored. This approach is NOT supported in Unix, but many computer users from other OSs (notably Windows) default to acting as though it is.
A process can assume an executable file with no magic number is in the process's own language by default. The Bourne shell, by virtue of having been the dominant early Unix scripting language, is usually assumed as the default.
One particular magic number, 0x2321 (in big-endian), which renders as #! in ASCII and is interpreted as a comment in most scripting languaged, tells the kernel to execute a specific, given program with the file as an argument. The first ~16 to ~80 bytes after the #! up to a newline (the length limit is different on different flavors/eras of Unix) specifies the program and any arguments. This interpreter directive (one of several names with no clear winner) is combined with the pathname of the file (or some other way to open it) and created as a new process. The result is nearly identical to a user having simply entered the interpreter and filename on the command line, except that the user is freed from needing to know the program and options required, and thereby also freed of needing to know of the interpreter changes. Some examples of interpreter directives include:
- #!/bin/sh
- #!/bin/bash -ex
- #!/usr/bin/perl -w
As well as less common applications:
- #!/usr/bin/less (on a README file, mode 755)
- #!/usr/bin/env perl (to apply PATH to finding perl)
- #!/usr/bin/gnuplot (followed by plotting commands)
- #!/bin/echo file: (try this to see the script's filename passed through)

Interpreter Directives are an Intrinsic Part of File Content

Interpreter directives can only be changed by modifying the files' contents, whereas file extensions can be changed arbitrarily using general filesystem commands like mv. File extensions also have a disturbing tendency to get lost in some contexts, since they're part of a Unix directory entry, not part of the file itself. In contrast, interpreter directives are quite stable. With scripts, interpreter directives are typically changed in the same manner as the other contents through using a text editor. Modern editors can usually recognize scripts by their interpreter directive, although historically special handling of certain types of text files was usually done based on the file extension.

Humans are the Problem

Now, so far, command name extensions might look like no more than hints to editors to use the correct editing mode, or to humans to make it easy to ls by script type.. The kernel doesn't view them specially at all - they're only just more bytes in the filename. But there is an insidious problem with them, in that using them breaks part of the mechanism by which the implementation details are hidden from the user, and from other programs written by users. It's the humans' attempt to apply the information in these command name extensions that causes problems.

Effects of Porting Programs Between Languages

Typically, programs in Unix often start their lives as quickly written, inefficient, under-featured shell scripts. Later, they get converted to something faster, like PERL or python. Finally, they are often rewritten C, C++, or something else fully compiled. If the author violates encapsulation by exposing the underlying language in a spurious extension, the command name may change from a name.sh, to name.pl, to name, breaking all existing coded calls to the program each time, as well as adding to the cognitive load of human users. The more effective the user base has been at script-based factoring and reuse, the more treacherous the extensions become (ie. proficient users often build more readily on preëxisting programs, increasing the number of dependencies on the names of those programs).

To combat the problem of breaking dependencies, what usually happens is that when the name.sh script ends up being rewritten in (for example) PERL, the now-misleading old name is retained to keep from breaking other programs which refer to it. The resulting mismatch causes extra maintenance hassles principally to users trying to maintain the extensions, who naïvely type things like ls -l *.sh without realizing some of the listed files aren't shell scripts anymore. Such semantic dissonance leads easily to more serious issues, with scripts called by the wrong interpreters in error-suppressed contexts, truncated processing due to the resulting errors, and the resulting arbitrarily disastrous problems.

Command Name Extensions Are Often Wrong - and Subtly

The issue of using the wrong interpreter can be subtle, since a user seeing a name.py program may enter python name.py, not realizing that the program only works with python 2.5 when 2.4 is still the system default (the former would have a directive like #!/usr/bin/python2.5). Most scripts suffixed with .sh on Linux are actually bash scripts, and many version of Unix don't include bash, just the real /bin/sh (no arrays, no $(…), no <(…), etc). Scripts also often make delicate use of interpreter directives to have the PATH used or ignored, or special options passed in, none of which is capturable in a primitive filename extension.

Some Command Name Extensions Matter - But Not To the Kernel

There are cases where scripts are executed as a result of special extensions, such as the model currently used by most webservers where file handling is cued by filename extensions. However, even such subsystems often have other, more sophisticated approaches allowing those same extensions to be hidden, and thus protect URIs from a variant of the script filename extension problem, namely, how to keep all links to your website from breaking with you switch from *.html files to *.cgi, *.php, or something else. Furthermore, of the extensions just listed, note that .html files aren't scripts, .php files use a webserver builtin, and that .cgi scripts themselves require interpreter directives to be executed correctly as well as the .cgi for Apache to permit the script to be run.

Commands should never have filename extensions.

Rely on interpreter directives instead or some other paradigm that prevents the implementation from being exposed, or worse yet, lied about, within the very name of the command. The best place is in the file itself, though as noted, there are some issues to deal with through #!/usr/bin/env and other tactics.

Appendices

Python

So you have this file named foo.py...

If foo is a library with a unittest activated by being run with python foo.py or even as just ./foo.py, that's okay - that's not a command that would live in $PATH. However, if it's a full program with no library aspirations, the .py is clearly wrong. I've almost never seen a foo.py in the $PATH, since such hacks usually end up littering top level Flask directories and placed in the $PYTHONPATH (for python libraries) instead.

There's a case where, in some bin/ directory, there are both a foo.py implementing a library, and a foo implementing the options parsing and using library. In this situation the foo is executable and the foo.py isn't, and because the .py isn't this situation is fine (though rare).

As an example, here's a library hellolib.py and a program hi.py just as described above (save for the names):

# hellolib.py library import unittest def hi(whom=None): return 'Hello' + ((' ' + whom) if whom else ' World') class TestLib(unittest.TestCase): def test_hi(self): self.assertEqual('Hello World', hi()) self.assertEqual('Hello You', hi('You')) #!/usr/bin/python # "chmod 755 hi" so you can run ./hi import hellolib, sys someone = (sys.argv[1] if len(sys.argv) > 1 else None) print(hellolib.hi(someone))

There's no point in being able to run ./hellolib.py or python hellolib.py, because we're obviously just going to run nosetests hellolib anyway, as per standard practice. Otherwise, we'd have to add the rather ugly, though accepted, lines below:

⋮ # addendum to hellolib.py if __name__ == '__main__': unittest.main()

...which a bit nasty, since we'd have to either add execute permission on the library file too as well as a #! line, or guess at which version of Python is needed to run it manually, e.g. python hellolib.py Also, enabling execute permission makes nosetests's decision of whether it's safe to import the file (without causing side effects) much harder, so it doesn't test executable files default, and we risk the unittest in our library being skipped.

Listing All Files of the Same Script Type (sh, py, etc)

The issue of users wanting to be able to list, for example, all Bourne shell scripts easily with ls(1) is a big motivator to some people to name them all with .sh extensions. If ls had an option to filter based on the execution method of a file, say something like ls -e '*/sh' to list only files with /sh at the end of the first part of the interpreter directive, that would help. However, whether ls should even be doing such a job would probably be hotly, justifiably contested.

Here's an example of using a new program to address this problem:

$ cd /bin $ scripts /bin/sh | wc -l 10 $ scripts /bin/sh ./bzgrep ./bzmore ./running-in-container ./setupcon ./unicode_start ./lesspipe ./red ./bzdiff ./bzexe ./which $ less $(scripts /bin/sh) # …in a pager, looking at shell scripts... $ head -1 $(scripts) ==> ./zforce <== #!/bin/bash ==> ./bzgrep <== #!/bin/sh ==> ./bzmore <== #!/bin/sh ==> ./gunzip <== #!/bin/bash # …and so on… $ cd /usr/bin $ head -1 $(scripts) | grep '#!' | sort | uniq -c | sort -nr 191 #!/bin/sh 104 #!/usr/bin/perl -w 102 #!/usr/bin/perl 58 #! /usr/bin/python # about that space, now uncommon 53 #! /bin/sh # …38 variants overall, with some of these less common: 7 #!/usr/bin/ruby1.9.1 3 #!/usr/bin/fontforge -lang=ff 2 #!/usr/bin/pypy 1 #!/usr/bin/env nickle

A sample script implementing the command (obviously with no extension in case someone wants to rewrite it in Python, Ruby, C, etc.). Note that this needs #!/bin/bash specifically, since classic /bin/sh doesn't support $(...) or local.

#!/bin/bash # Return a list of scripts having a given string in the interpreter directive. Syntax () { local regexp="$1" echo "Syntax: $0 [<regexp> [<file>|<dir>]...]" echo " $0 {-h | --help}" echo ' <regexp> - used to match interpreter directives' echo ' <file> - report file if <regexp> matches' echo ' <dir> - report each file in <dir> for which <regexp> matches' echo 'If no <file> or <dir> is given, "." is used as a default.' echo 'Give a <regexp> of "." to use the default ('"$regexp"') with <file>/<dir>.' echo 'NOTE: Only executable files are considered.' } ScanFile () { [ -x "$1" ] && head -1 "$1" | egrep -- -qs "$2" ; } ScanStuff () { local found=false local regexp="$1" ; shift local thing dir file for thing in "$@" ; do if [ -d "$thing" ] ; then dir="$thing" for file in $(find "$dir" -name . -o -type d -prune -o -type f -print) ; do ScanFile "$file" "$regexp" && echo "$file" && found=true done else file="$thing" ScanFile "$file" "$regexp" && echo "$file" && found=true fi done $found } Main () { local regexp='^#!' case "$1" in --) shift ;; -h|--help) Syntax "$regexp" ; exit 0 ;; -*) Syntax "$regexp" 1>&2 ; exit 1 ;; esac [ $# -ge 1 ] && { regexp="$1" ; shift ; } [ $# -eq 0 ] && set . ScanStuff "$regexp" "$@" } Main "$@" #---eof

Obviously we can reimplement scripts in any language we want without telling any of its other users, because it doesn't have some [expletive deleted] extension on the end, and so for everyone else it'll just keep working.

Why Are So Many Developer Recently Misusing Extensions?

This is... a theory.

In the late 1980s (based my experience at the time) , commandname extensions were essentially absent from the Unix realm. Almost all scripting was either in Bourne shell, or in the Csh a few screwballs (included myself and others) tried to make work as a scripting language. Ksh, Tcsh, and a few others were used at some sites. Interpreter directives were required for all of them except Bourne shell scripts, since sh would attempt to execute a executable script via the kernel, but if that failed it would just assume it was an sh script (they ALL were a decade before, so it made some sense), and spawn a shell to interpret it, which worked badly when the script was actually written in any of the other things.

In the 1990s, commandname extensions showed up occasionally when DOS/Windows users started poking at Linux and dragging along the DOS extension concept with them. However, DOS hides filename extensions - you can run a DOS script even if the extension is omitted when invoking it - so in theory they were hiding metadata (and, coincidentally, creating an inroad for Trojan attacks) instead of exposing the implementation language. In contrast, Unix requires the entire name of the file to run commands - including any extensions (or a string of them) since they're just more characters - the . isn't special to the kernel, just part of the name. Essentially the DOS practice is totally wrongheaded in the Unix environment. Fortunately, during this period more experienced Unix users tended to educate the DOS arrivals soon enough to keep the practice from being all that common.

In the 2000s, and increasingly in 2010 and beyond, there was a sudden explosion in commandname extensions, but not from the DOS migrants, but rather from a new sub-population of programmers in languages like PHP, PERL (to some extent), Python, Ruby and others - all languages which were NOT compiled, and whose libraries tend to require extensions, and whose users typically had little to no grounding in Unix fundamentals, and hadn't worked in C (which produces executables without extensions most of the time). These programmers improperly overgeneralized the use of extensions from libraries to command scripts, and then wrote lots of documentation that included this aberrant practice. And now, suddenly they're everywhere, doing it wrong while thinking it's right (that what the docs say, after all), and driving those who actually know how it works slightly insane.

So now we the insane ones are writing little webpages like this to tell the interpreted-language crowd, please, please be more sparing in your extensions. They don't belong on commands. Really. Ever. Every time you mutilate a command by putting an extension on it, some angry computing god out there kills a kitten.

Please - think of the kittens.

Thanks to:

(Note: Don't copy/paste the addresses, just type what they suggest.)

Jakub Wilk ‹jwilkⓐjwilk◦net› for relevant yacc info, and catching various typos.