Some complementary material on
can be found in Wikipedia
Herein, a problem with filename extensions is described in a manner
perhaps more pragmatic than, yet inspired by, the well known
Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968).
Dijkstra's work addresses the issue of how the use of the go
to statement largely abridges the ability to parametrically
describe the progress of a process, engendering an unnecessary
impediment to the code's clarity and manageability.
This new document details, based on practical experience under Unix-like
operating systems, how filename extensions, particularly but not
limited to those files implementing commands, create a secondary set of
semantic tags in the interfaces between between programs which are
demonstrably both superfluous and treacherous.
Consider the following example, which a file is name with a .sh
extension, to indicate the type of the file as well as to make it easy
to list all files of the same type (shell scripts).
$ ls *sh
$ sh frob.sh
frob.sh: line 2: use: command not found
$ cat frob.sh
printf "hello world\n";
Such a file is typical of scripts written by relatively inexperienced
users of Unix, where the code has been later reimplemented but the
filename left unaltered for backwards compatibility. The surprises in
the second two commands should be self-evident, and are the focus of
Three common mechanisms exist within Unix to determine how a file should
be processed as a set of directives:
- A process can elect to treat a file specially based upon a
file extension, which provides a small subset of the magic
number functions, but allows for data within the file to omit having
any kind of recognizable in-band header. A filename is file
meta-data, and can be set to disagree with the file's actual internal
- For compiled binaries, the kernel uses the start of the file, consisting of a
magic number and attendent data, to determine whether the file
can be executed. Processes can also use magic numbers to determine
whether a file might dynamically-loadable code, as in dynamic
libraries, or might contain data in a known binary format, such as
images, audio data, and so forth.
- A process can elect to arbitrarily associate a file with the run of
some program, as when a shell user runs some interpreter program by
name with an explicitly specified data source, such as
using sh explicit to read in and run a shell script. With
most interpreters, this will cause both filename extensions and magic
numbers to be ignored.
- As a composite of the latter two of these, the special
“#!” magic number (where the initial two bytes are
convenient ASCII and human-readable, as well as appearing to be a
comment to many interpreters) can be used to specify the approprate
interpreter program to process the file's contents.
This interpreter directive contains the pathname and
possibly arguments to the interpreter program, which will be combined
with the pathname of the file and created as a new process, which is
expected to open and process the file. The result is nearly
identical to a user having simply entered the interpreter and
filename on the command line, except that the user is freed of
needing to know the interpreter, and thereby also freed of needing to
know of the interpreter changes. Some examples of interpreter
As well as less common applications:
- #!/usr/bin/perl -w
- #!/usr/bin/less (on a README file, mode 755)
- #!/usr/bin/env perl (to apply PATH to finding perl)
- #!/usr/bin/gnuplot (followed by plotting commands)
- #!/bin/echo file: (try this to see the script's filename passed through)
Interpreter directives can only be changed by modifying the files'
contents, whereas file extensions can be changed arbitrarily using
general filesystem commands like mv. File extensions also
have a disturbing tendency to get lost in some contexts, in contrast
with interpreter directives which are quite stable. With scripts,
interpreter directives are typically changed in the same manner as the
other contents, by using an editor of same kind, usually vi
or emacs. Modern editors can usually recognize scripts by
their interpreter directive, although historically special handling of
certain types of text files was usually done based on the file
extension. It's noteworthy that the file extension model applied
almost exclusively to non-script, non-executable text files lacking
interpreter directives (C files ending in .c and .h),
which use extensions to trigger special handling by an application such
as the C compiler rather than from the kernel.
Now, so far, extensions might look like no more than triggers for
special handling for editing sessions, or as human-readable metadata
allowing easy categorization without the effort of viewing the files'
contents. But there is a more insidious problem with them, in that
using them breaks part of the mechanism by which the implementation
details are hidden from the user, from the kernel, and from other
programs which might call the script.
Typically, programs in Unix often start their lives as quickly written,
inefficient, underfeatured shell scripts. Later, they get converted to
something faster, like PERL or python. Finally, they are often
rewritten C, C++, or something else fully compiled. If the author
violates encapsulation by exposing the underlying language in a
spurious extension, the command name may change from
a name.sh, to
name.pl, to name, breaking all existing coded
calls to the program each time, as well as adding to the congnitive
load of human users. The more effective the user base has been at
script-based factoring and reuse, the more treacherous the extensions
become (ie. proficient users often build more readily on preëxisting
programs, increasing the number of dependencies on the names of those
In fact, what usually happens is that the name.sh
script ends up being rewritten in something like PERL, yet with the
now-misleading old name retained to keep from breaking other programs
which refer to it. The resulting mismatch causes extra maintenance
hassles principally to users trying to maintain the extensions, who
naïvely type things like
ls -l *.sh without realizing some of the the listed files
aren't shell scripts anymore. Such semantic dissonance leads
easily to more serious issues, with scripts called by the wrong
interpreters in error-suppressed contexts, truncated processing due to
the resulting errors, and the resulting arbitrarily disastrous problems.
The issue of users wanting to be able to list, for example, all Bourne
shell scripts easily with ls(1) is a big motivator to them to name them
all with .sh extensions.
If ls(1) had an option to filter based on the execution method of a
file, say something like
ls -e '*/sh'
to list only
files with /sh at the end of the first part of the
interpreter directive, that would help. However, whether ls(1) should
even be doing such a job would probably be hotly, justifiably
The issue of using the wrong interpreter can be subtle, since a user
seeing a name.py program may enter
python name.py, not realizing that the program
only works with python 2.5 when 2.4 is still the system default
(the former would have a directive like #!/usr/bin/python2.5).
Scripts also often make delicate use of interpreter directives to
have the PATH used or ignored, or special options passed in.
There are cases where scripts are executed as a result of special
extensions, such as the model currently used by most webservers where
file handling is cued by filename extensions. However, even such
subsystems often have other, more sophisticated approaches allowing
those same extensions to be hidden, and thus protect URIs from a a
variant of the script filename extension problem, namely, how to keep
all links to your website from breaking with you switch
from *.html files to *.cgi, *.php, or
something else. Futhermore, of the extensions just listed, note that
.html files aren't scripts, .php files use a
webserver builtin, and that .cgi scripts themselves require
interpreter directives as well as the .cgi.
Commands should never have filename extensions.
Rely on interpreter directives instead or some other paradigm that
prevent the implementation from being exposed, or worse yet, lied
about, within the very name of the command.