sidebar [moon] home
Buy
Our
Stuff!

Erlkönig: Commandname Extensions Considered Harmful

parent
[parent webpage]

server
[webserver base]

search
[search erlkonig webpages]

trust
[import certificates]


homes
[talisman]
[zoion]
[fu-berlin]
[tx-planroom]

Herein, a problem with filename extensions is described in a manner much more pragmatic than, yet inspired by, the well known Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968). That work addresses the issue of how the use of the go to statement largely abridges the ability to parametrically describe the progress of a process, engendering an unnecessary impediment to the code's clarity and manageability. This work details, base on practical experience under Unix-like operating systems, how filename extensions, particularly but not limited to those files implementing commands, create a secondary set of semantic tags in the interfaces between between programs which are demonstrably both superfluous and treacherous.

$ ./frob.bash
hello world
$ ls *sh
frob.bash
$ sh from.bash
frob.bash: line 2: use: command not found
hello world
$ cat from.bash
#!/usr/bin/perl -w
use strict;
printf "hello world\n";
$  

Three common mechanisms exist within Unix to determine how a file should be processed as a set of directives:

  • The kernel uses the start of the file, consisting of a magic number and attendent data, to determine whether the file can be executed. Processes can also use magic numbers to determine whether a file might have suitable content for execution. Scripts often begin with a two-byte magic number in which each byte conveniently falls within the printable ASCII character set, that is, “#!”.
  • A process can elect to treat a file specially based upon a file extension, which provides a small subset of the magic number functions, but allows for data within the file to omit having any kind of recognizable in-band header. A filename is file meta-data, and can be set to disagree with the file's actual internal data.
  • A process can elect to arbitrarily associate a file with the run of some program, as when a shell user runs some interpreter program by name with an explicity specified data source. Depending on the interpreter, both filename extensions and magic numbers might be ignored.
  • As a composite of two of these, the “#!” magic number can be used to specify the approprate interpreter program to process the file's contents. This interpreter directive contains the pathname and possibly arguments to the interpreter program, which will be combined with the pathname of the file and created as a new process, which is expected to open and process the file. The result is nearly identical to a user having simply entered the interpretor and filename on the command line, except that the user is freed of needing to know the interpreter, and thereby also freed of needing to know of the interpreter changes. Some examples of interpreter directives include:
    • #!/bin/sh
    • #!/bin/bash
    • #!/usr/bin/perl
    • #!/usr/bin/perl -w
    As well as less common applications:
    • #!/usr/bin/less (on a README file, mode 755)
    • #!/usr/bin/env perl (to apply PATH to finding perl)
    • #!/usr/bin/gnuplot (followed by plotting commands)
    • #!/bin/echo file: (try this to see the script's filename passed through)

    Interpreter directives can only be changed by modifying the files' contents, whereas file extensions can be changed arbitrarily using general filesystems commands like mv. File extensions also have a disturbing tendency to get lost in some contexts, in contrast with interpreter directives which are quite stable. With scripts, interpreter directives are typically changed in the same manner as the other contents, by using an editor of same kind, usually vi or emacs. Modern editors can usually recognize scripts by their interpreter directive, although historically special handling of certain types of text files was usually done based on the file extension. It's noteworthy that the file extension model applied almost exclusively to non-script, non-executable text files lacking interpreter directives (C files ending in .c and .h), which use extensions to trigger special handling by an application such as the C compiler rather than from the kernel.

    Now, so far, extensions might look like no more than triggers for special handling for editing sessions, or as human-readable metadata allowing easy categorization without the effort of viewing the files' contents. But there is a more insidious problem with them, in that using them breaks part of the mechanism (the other part is the PATH variable) by which the implementation details are hidden from the user, from the kernel, and from other programs which might call the script.

    Typically, programs in Unix often start their lives as quickly-written, inefficient, underfeatured shell scripts. Later, they get converted to something faster, like PERL or python. Finally, they get rewritten full out in C, C++, or something else fully compiled. If the author violates encapsulation by exposing the underlying language in a spurious extension, the command name may change from a NAME.sh, to NAME.pl, to NAME, breaking all existing coded calls to the program each time, as well as adding to the congnitive load of human users. The more effective the user base has been at script-based factoring and reuse, the more treacherous the extensions become.

    In fact, what usually happens is that the NAME.sh script ends up being rewritten in something like PERL, yet with the now-misleading old name retained to keep from breaking other programs which refer to it. The resulting mismatch causes extra maintenance hassles principally to users trying to maintain the extensions, who naïvely type things like ls -l *.sh without realizing some of the the listing files aren't shell scripts anymore. Such semantic dissonance leads easily to more insidious issues, with scripts called by the wrong interpretors in error-suppressed contexts, truncated processing over the resulting errors, and the resulting arbitrarily disastrous problems.

    There are cases where scripts are executed as a result of special extensions, such as the model currently used by most webservers where file handling is cued by filename extensions. Note that even such subsystems often have other, more sophisticated approaches allowing those same extensions to be hidden, and thus protect URIs from a a variant of the script filename extension problem, namely, how to keep all links to your website from breaking with you switch from *.html files to *.cgi, *.php, or something else. And of the extensions just listed, note that .html files aren't scripts, .php files use a webserver builtin, and that .cgi scripts themselves require interpretor directives as well as the .cgi.

    Commands should never have filename extensions.

    Interpretor directives should always be used for scripts. Moreover, in cases where a directive is given, users should even be reluctant to override them by running an interpretor on the script explicit on the command line. A directive may well be hand-tailored to the script, using using a different version of (for example) perl then the default system perl, for example, turning in the explicit override into a subtle chance for surprises, especially in sites with heterogeneous hardware, OSes, etc., where a script may depend on a network-wide version of the interpreter which differs from the ones on individual systems.

    So define the interpreter for your scripts using only the interpreter directive - the others lay on the path to darkness :-)

encrypt lang [de jp fr] backlinks (sec) validate [Free Speech Online Blue Ribbon Campaign] printable
[ Your browser's CSS support is broken. Upgrade! ]
alexsiodhe