[moon] home

Erlkönig: Commandname Extensions Considered Harmful

(none) parent
[parent webpage]

[webserver base]

[search erlkonig webpages]

[import certificates]


Some complementary material on interpreter directives can be found in Wikipedia

Herein, a problem with filename extensions is described in a manner perhaps more pragmatic than, yet inspired by, the well known Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968). Dijkstra's work addresses the issue of how the use of the go to statement largely abridges the ability to parametrically describe the progress of a process, engendering an unnecessary impediment to the code's clarity and manageability. This new document details, based on practical experience under Unix-like operating systems, how filename extensions, particularly but not limited to those files implementing commands, create a secondary set of semantic tags in the interfaces between between programs which are demonstrably both superfluous and treacherous.

Consider the following example, which a file is name with a .sh extension, to indicate the type of the file as well as to make it easy to list all files of the same type (shell scripts).

$ ./frob.sh
hello world
$ ls *sh
$ sh frob.sh
frob.sh: line 2: use: command not found
hello world
$ cat frob.sh
#!/usr/bin/perl -w
use strict;
printf "hello world\n";

Such a file is typical of scripts written by relatively inexperienced users of Unix, where the code has been later reimplemented but the filename left unaltered for backwards compatibility. The surprises in the second two commands should be self-evident, and are the focus of what follows.

Three common mechanisms exist within Unix to determine how a file should be processed as a set of directives:

  • A process can elect to treat a file specially based upon a file extension, which provides a small subset of the magic number functions, but allows for data within the file to omit having any kind of recognizable in-band header. A filename is file meta-data, and can be set to disagree with the file's actual internal data.
  • For compiled binaries, the kernel uses the start of the file, consisting of a magic number and attendent data, to determine whether the file can be executed. Processes can also use magic numbers to determine whether a file might dynamically-loadable code, as in dynamic libraries, or might contain data in a known binary format, such as images, audio data, and so forth.
  • A process can elect to arbitrarily associate a file with the run of some program, as when a shell user runs some interpreter program by name with an explicitly specified data source, such as using sh explicit to read in and run a shell script. With most interpreters, this will cause both filename extensions and magic numbers to be ignored.
  • As a composite of the latter two of these, the special “#!” magic number (where the initial two bytes are convenient ASCII and human-readable, as well as appearing to be a comment to many interpreters) can be used to specify the approprate interpreter program to process the file's contents. This interpreter directive contains the pathname and possibly arguments to the interpreter program, which will be combined with the pathname of the file and created as a new process, which is expected to open and process the file. The result is nearly identical to a user having simply entered the interpreter and filename on the command line, except that the user is freed of needing to know the interpreter, and thereby also freed of needing to know of the interpreter changes. Some examples of interpreter directives include:
    • #!/bin/sh
    • #!/bin/bash
    • #!/usr/bin/perl
    • #!/usr/bin/perl -w
    As well as less common applications:
    • #!/usr/bin/less (on a README file, mode 755)
    • #!/usr/bin/env perl (to apply PATH to finding perl)
    • #!/usr/bin/gnuplot (followed by plotting commands)
    • #!/bin/echo file: (try this to see the script's filename passed through)

    Interpreter directives can only be changed by modifying the files' contents, whereas file extensions can be changed arbitrarily using general filesystem commands like mv. File extensions also have a disturbing tendency to get lost in some contexts, in contrast with interpreter directives which are quite stable. With scripts, interpreter directives are typically changed in the same manner as the other contents, by using an editor of same kind, usually vi or emacs. Modern editors can usually recognize scripts by their interpreter directive, although historically special handling of certain types of text files was usually done based on the file extension. It's noteworthy that the file extension model applied almost exclusively to non-script, non-executable text files lacking interpreter directives (C files ending in .c and .h), which use extensions to trigger special handling by an application such as the C compiler rather than from the kernel.

    Now, so far, extensions might look like no more than triggers for special handling for editing sessions, or as human-readable metadata allowing easy categorization without the effort of viewing the files' contents. But there is a more insidious problem with them, in that using them breaks part of the mechanism by which the implementation details are hidden from the user, from the kernel, and from other programs which might call the script.

    Typically, programs in Unix often start their lives as quickly written, inefficient, underfeatured shell scripts. Later, they get converted to something faster, like PERL or python. Finally, they are often rewritten C, C++, or something else fully compiled. If the author violates encapsulation by exposing the underlying language in a spurious extension, the command name may change from a name.sh, to name.pl, to name, breaking all existing coded calls to the program each time, as well as adding to the congnitive load of human users. The more effective the user base has been at script-based factoring and reuse, the more treacherous the extensions become (ie. proficient users often build more readily on preëxisting programs, increasing the number of dependencies on the names of those programs).

    In fact, what usually happens is that the name.sh script ends up being rewritten in something like PERL, yet with the now-misleading old name retained to keep from breaking other programs which refer to it. The resulting mismatch causes extra maintenance hassles principally to users trying to maintain the extensions, who naïvely type things like ls -l *.sh without realizing some of the the listed files aren't shell scripts anymore. Such semantic dissonance leads easily to more serious issues, with scripts called by the wrong interpreters in error-suppressed contexts, truncated processing due to the resulting errors, and the resulting arbitrarily disastrous problems.

    The issue of users wanting to be able to list, for example, all Bourne shell scripts easily with ls(1) is a big motivator to them to name them all with .sh extensions. If ls(1) had an option to filter based on the execution method of a file, say something like
    ls -e '*/sh'
    to list only files with /sh at the end of the first part of the interpreter directive, that would help. However, whether ls(1) should even be doing such a job would probably be hotly, justifiably contested.

    The issue of using the wrong interpreter can be subtle, since a user seeing a name.py program may enter python name.py, not realizing that the program only works with python 2.5 when 2.4 is still the system default (the former would have a directive like #!/usr/bin/python2.5). Scripts also often make delicate use of interpreter directives to have the PATH used or ignored, or special options passed in.

    There are cases where scripts are executed as a result of special extensions, such as the model currently used by most webservers where file handling is cued by filename extensions. However, even such subsystems often have other, more sophisticated approaches allowing those same extensions to be hidden, and thus protect URIs from a a variant of the script filename extension problem, namely, how to keep all links to your website from breaking with you switch from *.html files to *.cgi, *.php, or something else. Futhermore, of the extensions just listed, note that .html files aren't scripts, .php files use a webserver builtin, and that .cgi scripts themselves require interpreter directives as well as the .cgi.

    Commands should never have filename extensions.

    Rely on interpreter directives instead or some other paradigm that prevent the implementation from being exposed, or worse yet, lied about, within the very name of the command.

encrypt lang [de jp fr] diff backlinks (sec) validate printable
Walk without rhythm and you won't attract the worm.
[ Your browser's CSS support is broken. Upgrade! ]
alexsiodhe, christopher north-keys, christopher alex north-keys