|
Herein, a problem with filename extensions is described in a manner
much more pragmatic than, yet inspired by, the well known
Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968).
That work addresses the issue of how the use of the go
to statement largely abridges the ability to parametrically
describe the progress of a process, engendering an unnecessary
impediment to the code's clarity and manageability.
This work details, base on practical experience under Unix-like
operating systems, how filename extensions, particularly but not
limited to those files implementing commands, create a secondary set of
semantic tags in the interfaces between between programs which are
demonstrably both superfluous and treacherous.
$ ./frob.bash
hello world
$ ls *sh
frob.bash
$ sh from.bash
frob.bash: line 2: use: command not found
hello world
$ cat from.bash
#!/usr/bin/perl -w
use strict;
printf "hello world\n";
$
Three common mechanisms exist within Unix to determine how a file should
be processed as a set of directives:
- The kernel uses the start of the file, consisting of a
magic number and attendent data, to determine whether the file
can be executed. Processes can also use magic numbers to determine
whether a file might have suitable content for execution. Scripts
often begin with a two-byte magic number in which each byte
conveniently falls within the printable ASCII character set, that
is, “#!”.
- A process can elect to treat a file specially based upon a
file extension, which provides a small subset of the magic
number functions, but allows for data within the file to omit having
any kind of recognizable in-band header. A filename is file
meta-data, and can be set to disagree with the file's actual internal
data.
- A process can elect to arbitrarily associate a file with the run of
some program, as when a shell user runs some interpreter program by
name with an explicity specified data source. Depending on the
interpreter, both filename extensions and magic numbers might be
ignored.
- As a composite of two of these, the “#!” magic number
can be used to specify the approprate interpreter program to process
the file's contents. This interpreter directive contains
the pathname and possibly arguments to the interpreter program, which
will be combined with the pathname of the file and created as a new
process, which is expected to open and process the file. The result
is nearly identical to a user having simply entered the interpretor
and filename on the command line, except that the user is freed of
needing to know the interpreter, and thereby also freed of needing
to know of the interpreter changes. Some examples of interpreter
directives include:
- #!/bin/sh
- #!/bin/bash
- #!/usr/bin/perl
- #!/usr/bin/perl -w
As well as less common applications:
- #!/usr/bin/less (on a README file, mode 755)
- #!/usr/bin/env perl (to apply PATH to finding perl)
- #!/usr/bin/gnuplot (followed by plotting commands)
- #!/bin/echo file: (try this to see the script's filename passed through)
Interpreter directives can only be changed by modifying the files'
contents, whereas file extensions can be changed arbitrarily using
general filesystems commands like mv. File extensions also
have a disturbing tendency to get lost in some contexts, in contrast
with interpreter directives which are quite stable. With scripts,
interpreter directives are typically changed in the same manner as the
other contents, by using an editor of same kind, usually vi
or emacs. Modern editors can usually recognize scripts by
their interpreter directive, although historically special handling of
certain types of text files was usually done based on the file
extension. It's noteworthy that the file extension model applied
almost exclusively to non-script, non-executable text files lacking
interpreter directives (C files ending in .c and .h), which use
extensions to trigger special handling by an application such as the C
compiler rather than from the kernel.
Now, so far, extensions might look like no more than triggers for special
handling for editing sessions, or as human-readable metadata
allowing easy categorization without the effort of viewing the files'
contents. But there is a more insidious problem with them, in that
using them breaks part of the mechanism (the other part is the PATH
variable) by which the implementation details are hidden from the user,
from the kernel, and from other programs which might call the script.
Typically, programs in Unix often start their lives as quickly-written,
inefficient, underfeatured shell scripts. Later, they get converted to
something faster, like PERL or python. Finally, they get rewritten
full out in C, C++, or something else fully compiled. If the author
violates encapsulation by exposing the underlying language in a
spurious extension, the command name may change from a NAME.sh, to
NAME.pl, to NAME, breaking all existing coded calls to the program each
time, as well as adding to the congnitive load of human users. The
more effective the user base has been at script-based factoring and
reuse, the more treacherous the extensions become.
In fact, what usually happens is that the NAME.sh script ends up being
rewritten in something like PERL, yet with the now-misleading old name
retained to keep from breaking other programs which refer to it.
The resulting mismatch
causes extra maintenance hassles principally to users trying to
maintain the extensions, who naïvely type things like
ls -l *.sh without realizing some of the the listing files
aren't shell scripts anymore. Such semantic dissonance leads
easily to more insidious issues, with scripts called by the wrong
interpretors in error-suppressed contexts, truncated processing over
the resulting errors, and the resulting arbitrarily disastrous problems.
There are cases where scripts are executed as a result of special
extensions, such as the model currently used by most webservers
where file handling is cued by filename extensions. Note that even
such subsystems often have other, more sophisticated approaches
allowing those same extensions to be hidden, and thus protect URIs from
a a variant of the script filename extension problem, namely, how to
keep all links to your website from breaking with you switch from *.html
files to *.cgi, *.php, or something else. And of the extensions just
listed, note that .html files aren't scripts, .php files use a webserver
builtin, and that .cgi scripts themselves require interpretor directives
as well as the .cgi.
Commands should never have filename extensions.
Interpretor directives should always be used for scripts. Moreover, in
cases where a directive is given, users should even be
reluctant to override them by running an interpretor on the script
explicit on the command line. A directive may well be hand-tailored to
the script, using using a different version of (for
example) perl then the default system
perl, for example, turning in the explicit override into a
subtle chance for surprises, especially in sites with heterogeneous
hardware, OSes, etc., where a script may depend on a network-wide
version of the interpreter which differs from the ones on individual
systems.
So define the interpreter for your scripts using only the interpreter
directive - the others lay on the path to darkness :-)
|