Created: 2016-01-23
Some complementary material on
interpreter directives
and
command name issues
can be found in Wikipedia
You can also read the
email
in which
Dennis Ritchie
introduced #!.
ABSTRACT
The command name for any Unix script must be stable for
any complex system based on it to be stable.
However, this is being
compromised through practices based on misinformation.
†
This paper explores how scripts are actually run,
how naming affects correctness and stability,
and various common misconceptions
in order to clarify the reasons behind standard practice
- which is:
Command names
should
never
have filename extensions.
Command name extensions have numerous issues:
- They unnecessarily expose implementation detail
(breaking encapsulation).
- They uselessly and incompletely mimic detail from the #! line.
- They capture insufficient detail to be useful at the system level
(and aren't used).
- They clash with recommended Unix (and Linux) practice.
- They add noise to the command-level API.
- They are very commonly technically incorrect for the script.
- They give incorrect impressions about the use of the files they adorn.
- They aren't validated even for what little info is present in them.
- They interfere with switching scripting languages.
- They interfere with changing scripting language versions.
- They interfere with changing to presumably-faster compiled forms.
- They encourage naïvely running scripts with the extension-implied
interpreter.
- They infect novice scripters with misinformation about Unix scripting.
Ironically all of these are problems involving interpretation by humans.
Herein, a problem with filename extensions is described in a manner
perhaps more pragmatic than, yet inspired by, the well known
Go To Statement Considered Harmful
by Edsger W. Dijkstra
(Communications of the ACM, Vol. 11, No. 3, March 1968).
Dijkstra's work addresses the issue of how the use of the go
to statement largely abridges the ability to parametrically
describe the progress of a process, engendering an unnecessary
impediment to the code's clarity and manageability.
This new document details, based on practical experience under Unix-like
operating systems, how filename extensions, particularly but not
limited to those files implementing commands, create a secondary set of
semantic tags in the interfaces between programs which are
demonstrably both superfluous and treacherous.
It's not a coincidence that in both Dijkstra's plaint and this one that
computers are not at all affected by either practice - it's entirely a
problem for just the humans.
What is a Command
For purposes of this paper, command names are the filenames of all the
executable files in the directories in the Unix $PATH environment variable.
By convention, almost all such directories end in bin
(nominally suggestive for tool bin and not restricted to binaries ),
with sbin (system bin ), and games also being common.
Historically etc or lib occasionally appeared in $PATH, but this has become increasingly rare since the 1980s.
Didactic Examples
Consider the following examples, in which files have
.sh and .py
extension, ostensibly to indicate the type of the file as well as to
make it easy to list all files of the same type (shell scripts).
Running them based on the apparently-correct interpreter doesn't go well
(the […] means truncated for brevity):
$ sh frob.sh
./frob.sh: 2: Syntax error:
$ ./frob.sh
hello world
$ cat frob.sh
#!/usr/bin/perl -w
# Used to be a shell script,
# but we couldn't change the name...
use strict;
printf("hello world\n");
$
$ python knob.py
File "knob.py", line 2
func = print
^
SyntaxError: invalid syntax
$ ./knob.py
hello world
$ cat knob.py
#!/usr/bin/python3
func = print
func("hello world")
$
$ sh qux.sh
qux.sh: 3: qux.sh: Syntax error:
$ ./qux.sh
hello world
$ cat frob.sh
#!/bin/bash
cat <<<'hello World'
$
These scripts show some problems around trusting extensions:
- The left example is typical when the code has been later reimplemented
(here from sh to perl) but the
filename was left unaltered for backwards compatibility.
- The right example, which can occur either the same way as the one just
mentioned or through a number of other reasons (python3 was
used where python2 was still the system default, etc), show
how a commandname extension
doesn't capture enough information to be useful, but instead breeds
false confidence.
- The bottom example shows that even for shell scripts,
the Bourne shell is a subset of Bourne-Again syntax,
yet people often use .sh where only .bash
would have made any sense, and only for .-ables
(function libraries and support scripts, not programs).
Failures aren't always as obvious as immediately exiting with
an error: More subtle distinctions in script language execution, or a
script with sufficient error trapping to survive being run with the
wrong interpreter version, could result in incorrect results and
serious damage.
$ python divide.py 5 2
2
$ ./divide.py 5 2
2.5
$ cat divide.py
#!/usr/bin/python3
import sys
a, b = int(sys.argv[1]), int(sys.argv[2])
print(a / b)
$
(The same issue can arise through command search in $PATH
finding a different version of a program than expected,
especially when using virtual environments,
but that's outside of the scope of this document)
Methods of Specifying Interpreters for Scripts
Several mechanisms exist to determine how a file should be executed,
whether as a set of directives or as machine code. The ones relating
to this discussion are:
-
A process can let the the kernel (in the exec call) use the
magic number
found at the start of a file and usually followed by
additional data to determine details of the executable format of the file.
Magic numbers are usually found at the beginning of files, encoding
whether a file contains a compiled program,
dynamically-loadable code,
or other data in a known binary format, such as
images, audio data, and so forth.
Typical examples are compiled programs,
which generally lack the extensions that were attached to the source
files from which they were compiled.
-
A process can elect to treat a file specially based upon a
file extension, which provides a small
subset of the magic number functions, but allows for data within the
file to omit having any kind of recognizable in-band header. In most
common Unix filesystems a filename isn't exactly file metadata, but
instead part of each directory linking to the file, and can be easily
set to disagree with the file's internal data. Unix also leaves
handling them to be a userspace function which varies widely between
different applications, has no direct kernel support, and is most
common on file type like audio, images, video, compiled programming
language files, etc.
-
A process can elect to use arbitrary
filetype association
to a file with the run of some other program, as when a
shell user runs some interpreter program by name with an explicitly
specified data source, such as using sh explicitly to read
in and run a shell script. With most interpreters, this will cause
both filename extensions and magic numbers to be ignored. This
approach is NOT supported in Unix, but many computer users from other
OSs (notably Windows) default to acting as though it is.
-
A process can assume an executable file with no magic number is
in the process's own language by default.
The Bourne shell, by virtue of having been the dominant early
Unix scripting language, is usually assumed as the default.
-
One particular magic number, 0x2321 (in big-endian), which
renders as #! in ASCII and is interpreted as a comment
in most scripting languaged, tells the kernel to execute a specific,
given program with the file as an argument.
The first ~16 to ~80 bytes after the #! up to a newline
(the length limit is different
on different flavors/eras of Unix) specifies the program and any
arguments.
This
interpreter directive
(one of several names with no clear winner)
is combined with the pathname of the file (or some other way to open it)
and created as a new process.
The result is nearly identical to a user having simply entered the
interpreter and filename on the command line,
except that the user is freed from needing to know the program
and options required,
and thereby also freed of needing to know of the interpreter changes.
Some examples of interpreter directives include:
- #!/bin/sh
- #!/bin/bash -ex
- #!/usr/bin/perl -w
As well as less common applications:
- #!/usr/bin/less (on a README file, mode 755)
- #!/usr/bin/env perl (to apply PATH to finding perl)
- #!/usr/bin/gnuplot (followed by plotting commands)
- #!/bin/echo file: (try this to see the script's filename passed through)
Interpreter Directives are an Intrinsic Part of File Content
Interpreter directives can only be changed by modifying the files'
contents, whereas file extensions can be changed arbitrarily using
general filesystem commands like mv. File extensions also
have a disturbing tendency to get lost in some contexts, since they're
part of a Unix directory entry, not part of the file itself.
In contrast, interpreter directives are quite stable. With scripts,
interpreter directives are typically changed in the same manner as the
other contents through using a text editor. Modern editors can usually
recognize scripts by their interpreter directive, although historically
special handling of certain types of text files was usually done based
on the file extension.
Humans are the Problem
Now, so far, command name extensions might look like no more than hints to
editors to use the correct editing mode, or to humans to make it easy to
ls by script type.. The kernel doesn't view them specially at
all - they're only just more bytes in the filename. But there
is an insidious problem with them, in that using them breaks part of
the mechanism by which the implementation details are hidden from the
user, and from other programs written by users. It's the humans'
attempt to apply the information in these command name extensions
that causes problems.
Effects of Porting Programs Between Languages
Typically, programs in Unix often start their lives as quickly written,
inefficient, under-featured shell scripts. Later, they get converted to
something faster, like PERL or python. Finally, they are often
rewritten C, C++, or something else fully compiled. If the author
violates encapsulation by exposing the underlying language in a
spurious extension, the command name may change from
a name.sh, to
name.pl,
to name, breaking all existing coded
calls to the program each time, as well as adding to the cognitive
load of human users. The more effective the user base has been at
script-based factoring and reuse, the more treacherous the extensions
become (ie. proficient users often build more readily on preëxisting
programs, increasing the number of dependencies on the names of those
programs).
To combat the problem of breaking dependencies, what usually happens is
that when the name.sh script ends up being
rewritten in (for example) PERL, the now-misleading old name is
retained to keep from breaking other programs which refer to it. The
resulting mismatch causes extra maintenance hassles principally to
users trying to maintain the extensions, who naïvely type things like
ls -l *.sh without realizing some of the listed files
aren't shell scripts anymore. Such semantic dissonance leads
easily to more serious issues, with scripts called by the wrong
interpreters in error-suppressed contexts, truncated processing due to
the resulting errors, and the resulting arbitrarily disastrous problems.
Command Name Extensions Are Often Wrong - and Subtly
The issue of using the wrong interpreter can be subtle, since a user
seeing a name.py program may enter
python name.py, not realizing that the program
only works with python 2.5 when 2.4 is still the system default
(the former would have a directive like #!/usr/bin/python2.5).
Most scripts suffixed with .sh on Linux are actually bash
scripts, and many version of Unix don't include bash, just
the real /bin/sh (no arrays, no $(…), no <(…), etc).
Scripts also often make delicate use of interpreter directives to
have the PATH used or ignored, or special options passed in,
none of which is capturable in a primitive filename extension.
Some Command Name Extensions Matter - But Not To the Kernel
There are cases where scripts are executed as a result of special
extensions, such as the model currently used by most webservers where
file handling is cued by filename extensions. However, even such
subsystems often have other, more sophisticated approaches allowing
those same extensions to be hidden, and thus protect URIs from a
variant of the script filename extension problem, namely, how to keep
all links to your website from breaking with you switch
from *.html files to *.cgi, *.php, or
something else. Furthermore, of the extensions just listed, note that
.html files aren't scripts, .php files use a
webserver builtin, and that .cgi scripts themselves require
interpreter directives to be executed correctly as well
as the .cgi for Apache to permit the script to be run.
Commands should never have filename extensions.
Rely on interpreter directives instead or some other paradigm that
prevents the implementation from being exposed, or worse yet, lied
about, within the very name of the command. The best place is in
the file itself, though as noted, there are some issues to deal
with through #!/usr/bin/env and other tactics.
Appendices
Python
So you have this file named foo.py...
If foo is a library with a unittest activated by being run
with python foo.py or even as just ./foo.py, that's
okay - that's not a command that would live in $PATH.
However, if it's a full program with no library aspirations,
the .py is clearly wrong. I've almost never seen
a foo.py in the $PATH, since such hacks usually end
up littering top level Flask directories and placed in the
$PYTHONPATH (for python libraries) instead.
There's a case where, in some bin/ directory, there are both
a foo.py implementing a library,
and a foo implementing the options parsing and using library.
In this situation the foo is executable and
the foo.py isn't, and because the .py isn't
this situation is fine
(though rare).
As an example, here's a
library hellolib.py and a program hi.py just as described
above (save for the names):
# hellolib.py library
import unittest
def hi(whom=None):
return 'Hello' + ((' ' + whom) if whom
else ' World')
class TestLib(unittest.TestCase):
def test_hi(self):
self.assertEqual('Hello World', hi())
self.assertEqual('Hello You', hi('You'))
#!/usr/bin/python
# "chmod 755 hi" so you can run ./hi
import hellolib, sys
someone = (sys.argv[1] if len(sys.argv) > 1
else None)
print(hellolib.hi(someone))
There's no point in being able to
run ./hellolib.py or
python hellolib.py, because we're
obviously just going to run nosetests hellolib anyway, as per
standard practice. Otherwise, we'd have to add the rather ugly,
though accepted, lines below:
if __name__ == '__main__':
unittest.main()
...which a bit nasty, since we'd have to either add execute permission
on the library file too as well as a #! line, or
guess at which version of Python is needed to run it manually,
e.g. python hellolib.py Also, enabling execute permission makes
nosetests's decision of whether it's safe to import the file (without
causing side effects) much harder, so it doesn't test executable
files default, and we risk the unittest in our library being skipped.
Listing All Files of the Same Script Type (sh, py, etc)
The issue of users wanting to be able to list, for example, all Bourne
shell scripts easily with ls(1) is a big motivator to some people to
name them all with .sh extensions. If ls had an option to
filter based on the execution method of a file, say something
like ls -e '*/sh' to list only files with /sh at
the end of the first part of the interpreter directive, that would
help. However, whether ls should even be doing such a job would
probably be hotly, justifiably contested.
Here's an example of using a new program to address this problem:
$ cd /bin
$ scripts /bin/sh | wc -l
10
$ scripts /bin/sh
./bzgrep
./bzmore
./running-in-container
./setupcon
./unicode_start
./lesspipe
./red
./bzdiff
./bzexe
./which
$ less $(scripts /bin/sh)
$ head -1 $(scripts)
==> ./zforce <==
#!/bin/bash
==> ./bzgrep <==
#!/bin/sh
==> ./bzmore <==
#!/bin/sh
==> ./gunzip <==
#!/bin/bash
$ cd /usr/bin
$ head -1 $(scripts) | grep '#!' | sort | uniq -c | sort -nr
191 #!/bin/sh
104 #!/usr/bin/perl -w
102 #!/usr/bin/perl
58 #! /usr/bin/python
53 #! /bin/sh
7 #!/usr/bin/ruby1.9.1
3 #!/usr/bin/fontforge -lang=ff
2 #!/usr/bin/pypy
1 #!/usr/bin/env nickle
A sample script implementing the command (obviously with no extension
in case someone wants to rewrite it in Python, Ruby, C, etc.). Note
that this needs #!/bin/bash specifically, since classic
/bin/sh doesn't support $(...) or local.
#!/bin/bash
# Return a list of scripts having a given string in the interpreter directive.
Syntax () {
local regexp="$1"
echo "Syntax: $0 [<regexp> [<file>|<dir>]...]"
echo " $0 {-h | --help}"
echo ' <regexp> - used to match interpreter directives'
echo ' <file> - report file if <regexp> matches'
echo ' <dir> - report each file in <dir> for which <regexp> matches'
echo 'If no <file> or <dir> is given, "." is used as a default.'
echo 'Give a <regexp> of "." to use the default ('"$regexp"') with <file>/<dir>.'
echo 'NOTE: Only executable files are considered.'
}
ScanFile () { [ -x "$1" ] && head -1 "$1" | egrep -- -qs "$2" ; }
ScanStuff () {
local found=false
local regexp="$1" ; shift
local thing dir file
for thing in "$@" ; do
if [ -d "$thing" ] ; then
dir="$thing"
for file in $(find "$dir" -name . -o -type d -prune -o -type f -print) ; do
ScanFile "$file" "$regexp" && echo "$file" && found=true
done
else
file="$thing"
ScanFile "$file" "$regexp" && echo "$file" && found=true
fi
done
$found
}
Main () {
local regexp='^#!'
case "$1" in
--) shift ;;
-h|--help) Syntax "$regexp" ; exit 0 ;;
-*) Syntax "$regexp" 1>&2 ; exit 1 ;;
esac
[ $# -ge 1 ] && { regexp="$1" ; shift ; }
[ $# -eq 0 ] && set .
ScanStuff "$regexp" "$@"
}
Main "$@"
#---eof
Obviously we can reimplement scripts in any language we want
without telling any of its other users, because it doesn't have some
[expletive deleted]
extension on the end, and so for everyone else it'll just keep working.
Why Are So Many Developer Recently Misusing Extensions?
This is... a theory.
In the late 1980s (based my experience at the time) , commandname
extensions were essentially absent from the Unix realm. Almost all
scripting was either in Bourne shell, or in the Csh a few screwballs
(included myself and others) tried to make work as a scripting language.
Ksh, Tcsh, and a few others were used at some sites.
Interpreter directives were required for all
of them except Bourne shell scripts, since sh would attempt to
execute a executable script via the kernel, but if that failed it would
just assume it was an sh script (they ALL were a decade
before, so it made some sense), and spawn a shell to interpret it,
which worked badly when the script was actually written in any of the
other things.
In the 1990s, commandname extensions showed up occasionally
when DOS/Windows users started poking at Linux and dragging
along the DOS extension concept with them. However, DOS
hides filename extensions - you can run a DOS script even if
the extension is omitted when invoking it - so in theory they were
hiding metadata (and, coincidentally, creating an inroad for Trojan
attacks) instead of exposing the implementation language. In contrast,
Unix requires the entire name of the file to run commands -
including any extensions (or a string of them) since they're just more
characters - the . isn't special to the kernel, just part of
the name. Essentially the DOS practice is totally wrongheaded in the
Unix environment. Fortunately, during this period more experienced
Unix users tended to educate the DOS arrivals soon enough to keep the
practice from being all that common.
In the 2000s, and increasingly in 2010 and beyond, there was a sudden
explosion in commandname extensions, but not from the DOS migrants, but
rather from a new sub-population of programmers in languages like PHP,
PERL (to some extent), Python, Ruby and others - all languages which
were NOT compiled, and whose libraries tend to require
extensions, and whose users typically had little to no grounding in
Unix fundamentals, and hadn't worked in C (which produces executables
without extensions most of the time). These programmers improperly
overgeneralized the use of extensions from libraries to command
scripts, and then wrote lots of documentation that included this
aberrant practice. And now, suddenly they're everywhere, doing it
wrong while thinking it's right (that what the docs say, after all),
and driving those who actually know how it works slightly insane.
So now we the insane ones are writing little webpages like this
to tell the interpreted-language crowd, please, please be
more sparing in your extensions. They don't belong on commands.
Really. Ever. Every time you mutilate a command by putting
an extension on it, some angry computing god out there kills a kitten.
Please - think of the kittens.
Thanks to:
(Note: Don't copy/paste the addresses, just type what they suggest.)
-
Jakub Wilk ‹jwilkⓐjwilk◦net›
for relevant yacc info, and catching various typos.
|