Erlkönig: assert() == epic fail

The Case Against `assert()`

Modern use of assert() is as part of formal unit testing, which generally segregates the tests to code outside of that which will be shipped and runs them through automated testing. In this model, all of the assertion code is in the unit tests, and none in the primary source code.

Historically, the primary usage model for assert() was to bespeckle code everywhere with sanity checks, enable them all during testing, where triggering any of them will immediately produce a specific bug for development to fix, and then disable them when the software is shipped to the customer. This model isn't perfect, since it completely ignores what happens to customers who would have triggered the asserts, had they been enabled. The secondary usage model for assert(), either through intent to address this issue or often through simple negligence, was to just leave all the assert()s enabled in the code shipped to the customer.

The problem with the second historical approach above is that essentially every use of assert() identifies a place in the code where a problem was suspected, but the code has not been corrected, but instead exacerbated by explicitly forcing a fatal crash, losing all user data, context, and so on that might have been implicit in the program at the time.

An assert() in customer facing, stability-sensitive code is an epic failure. If your company has developed a smart phone application, a game, or nearly any client application to a remote server, any code that converts detected weirdness unconditionally into an abrupt and fatal software crash is a failure of both the developmental process and the underlying philosophy which guided it.

assert()s do belong in certain code outside of regression tests, such as computational code where the slightest hint of failure means that killing the application is safer than believing the suspect results. Applications like this are often characterized as managing authoritative data, financial or user account information, or having direct impact on the real world in manufacturing or other field where software error can have significant real-world consequences.

However, in many applications utter correctness is not mandatory, but the customer experience may demand application stability over utter correctness. This is particularly true where the output data is intended for human sensory consumption, such as sound or rendering software, rather than consumption by other programs. It's also true of applications which have an unacceptably long start time, real-time deadlines that must be met (MMORPGs, stock trading, machine control, etc.), or in markets where program crashes unacceptably erode product acceptance.

Thoughts about runtime assertions

This section doesn't attempt to specify a final solution to runtime assertions, but merely some thoughts on how they could be made useful to end users, instead of simply a set of logic bombs to make pre- and post-conditions more obvious in debugging and testing. Nor do they comprise a super-set of all varieties of asserts, since some uses are evaluated in the compile stage, or are intended to capture the stack as authentically as possible. However, one possible direction, for assertion improvement in software with stability valued over hardcore correctness, suggests that anomalies should:

Be Logged

Log the errors, preferably with a way of insuring that the logs have a good chance of making it back to development, but most importantly otherwise there may be no text version of the error, such as that produced by the classic assert(), at all.

Be characterized with a severity

Many developers will emplace the suicidal logic bomb that is assert() at the slightest provocation. If tagged additionally with something like the usual severity levels, { debug, info, notice, warn, err, crit, alert, emerg }, the method of informing the user can be appropriately configured from one place, and the program kept running to allow the user an opportunity to continue without data or experiential loss. Note that this is most suitable for programs which are not authoritative, and whose ongoing insanity wouldn't actually damage anything.

Be [optionally] characterized with a longevity

Transient errors such as failures to obtain rendering data from a remote server for a game world object may be corrected automatically in the next render pass. Failure to save a file for a user can be corrected by the user, if properly notified. In some cases a simple retry of an operation within the program may be enough. Some errors can be corrected from outside of the scope of the program, usually meaning the program itself cannot reliably determine retryability, as in cases of resource exhaustion (memory, disk, etc). Finally, although perhaps not exhaustively, we have permanent errors, where the code can determine that something important is no longer in a defined state, and the architecture isn't capable of correcting it - a case where a restart may be required. All of these gradations could be helpful in determining how or if the user should be informed of the problem. In many cases the longevity component can be folded into the severity - a graphics error's severity scales with the span of time over which it's visible - in other cases it may only inform how the surrounding code should be rewritten.

Be made known to the user

Inform the user, particularly if the user needs to restart the application. In many cases, a user may be able to continue, perhaps with inconsistent graphics or some other degraded aspect of the experience, but with the opportunity to modify goals an eye to continue or to save data and shutdown gracefully.

Have response delegated to the user

The key principle here is that if the software is in an undefined state, it is also likely that the software cannot reliably determine whether it's necessary to terminate. Providing a way for the user to interact with the ailing program's decision to continue or terminate may be critical to the user to avoid frustration, loss, or actual damages. Often, the simple programming choice to avoid calling exit() is enough to give the user a chance to salvage data, etc.

Be reported in as much context as possible

Errors within a subsystem of a program can in some cases be handled by resetting part of the data model rather than demanding a restart. Usually an assert() would have meant that the programmer didn't see an obvious solution, but an informed user may be able to correct the one flawed section of the data model without needing to reload everything or restart. Providing feedback with as much context to the user is generally better than the classic assert() info of file, line number, and cryptic text, but it should be pointed out that hiding any of the information may reduce the user's ability to address an error which the programmer manifestly failed to, so erring on the side of more information rather than less is suggested. Note that the same data may shorten time for developers to debug as well, by providing part of the context a developer would normally need to compose by reading the code. This simplest approach to this is simply to use more mature assertion styles, less like C's legacy default:

assert(a == b)

and more like Python's (note: intended specifically for unit tests):

assertLessEqual(a, b, msg="explain why a should be equal to or less than b")

such as:

SaneIfLessEqual(x, xmax, SError_Transient,
                "x coörd should be in max bound in dialbox input #%d", dial_number)

SaneIfEqual(bytes_to_write, bytes_written, SAlert,
            "bytes written should match expected for file %s", filename)

Be added to regression tests instead when possible

It's far better to catch a problem without a user involved; use regression tests as the primary mechanism, and use the reported assertion failures to identify how to extend the regression tests.

Have run-time selectable abort/coredump behavior

The assert() in the development cycle can be used to drop into a containing debugger (such as gdb), or create a core dump. The former is trivially available using the debugger's internal directives to halt whenever the SaneIf... wrapper is entered, while still keeping the full stack above it, although extra function calls could further tangle a corrupted stack. A core dump could be selected interactively by the user, although subject to the same limitation.

Still be excludable by compile-time options

It's trivial to construct these as macros to allow for their omission in production code, especially for onerously computer-intensive assertions. However, with CPU power being so readily available, retaining tham as something of a user warning system and safety net isn't unreasonable.