Five definitions of reliability

A large part of my day job revolves around reliability—testing software reliability, testing operating system features that make the software running on it more reliable, working on test tools that look for reliability problems. I think about reliability a lot. And here’s the amazing conclusion I’ve come to:

No one knows what reliability is.

Or, to put it more accurately, no one agrees on the definition of reliability as it applies to software. Is it

  1. Predictable behavior? The software should just work—it should respond in the way I expect within a consistent timeframe. This also implies that at a bare minimum it should never crash unexpectedly or corrupt the data I’m working on.
  2. Durability? The software should require very little maintenance over time—or perhaps none at all.
  3. Fault tolerance? The software should continue running even in the presence of multiple unexpected errors.
  4. Diagnosis and repair? If the software does fail for some reason, it should provide convenient tools or resources for recovering from that failure and resuming work.
  5. Efficient use of resources? The software should not use resources in a way that negatively impacts overall system performance.

At different times I’ve spoken with individuals who each valued one of the above characterizations of reliability above the others. I think there is a case to be made for each of them.

Generalizing from this multiplicity illuminates what I think is a key aspect of how to think about the quality—the definition of reliability is often domain-specific. Taking the first definition above, for example, who can say what constitutes predictable behavior without also discussing the particulars of the behavior that is expected?

As a tester working on system reliability in general, rather than reliability within the context of a specific component, this presents a dilemma—how can one generically test the reliability of a system if as a quality it is inherently domain-specific?

I’ll end with some of the open-ended questions I’ve been thinking and writing about as I attempt to resolve this dilemma and make progress in my work:

  1. How can we model any of the definitions above?
  2. Can we model the environmental factors that induce the behavior we wish to observe?
  3. How can we measure the response in a useful, quantitative manner?
  4. What is our failure model? Is it too coarse? Too fine-grained?
  5. What effect does our failure model have on the measurements we make?
  6. How can we look for these concerns when the code is still at the design stage?
  7. What would be a useful test tool for each of the reliability definitions above? Does it have to be domain-specific or is it possible to think of something generic?

No Comments