Should we design programs to randomly kill themselves? | Ars Technica

Biz & IT —

Should we design programs to randomly kill themselves?

Redundancy is good so why not force it?

Should we design programs to randomly kill themselves?
Stack Exchange
This Q&A is part of a weekly series of posts highlighting common questions encountered by technophiles and answered by users at Stack Exchange, a free, community-powered network of 100+ Q&A sites.

Jimbojw asks:

Should we design death into our programs, processes, and threads at a low level, for the good of the overall system?

Failures happen. Processes die. We plan for disaster and occasionally recover from it. But we rarely design and implement unpredictable program death. We hope that our services' uptimes are as long as we care to keep them running.

A macro-example of this concept is Netflix's Chaos Monkey, which randomly terminates AWS instances in some scenarios. They claim that this has helped them discover problems and build more redundant systems.

What I'm talking about is lower level. The idea is for traditionally long-running processes to randomly exit. This should force redundancy into the design and ultimately produce more resilient systems.

Does this concept already have a name? Is it already being used in the industry?

See the original question here.

Avoid at all costs

Telastyn answers (45 votes):

No.

We should design proper bad-path handling and design test cases (and other process improvements) to validate that programs handle these exceptional conditions well. Stuff like Chaos Monkey can be part of that, but as soon as you make "must randomly crash" a requirement, actual random crashes become things testers cannot file as bugs.

Related: "Should I intentionally break the build when a bug is found in production?"

No exit (code)

Kaz answers (3 vote):

Adding random exit code to the application should not be necessary. Testers can write scripts which randomly kill the application's processes.

In networking, it is necessary to simulate an unreliable network for the sake of testing a protocol implementation. This does not get built into the protocol; it can be simulated at the device driver level, or with some external hardware.

Don't add test code do the program for situations that can be achieved externally.

If this is intended for production, I can't believe it's serious!

Firstly, unless the processes exit abruptly so that in-progress transactions and volatile data is lost, then it's not a honest implementation of the concept.

Planned, graceful exits, even if randomly timed, do not adequately help prepare the architecture for dealing with real crashes, which are not graceful.

If real or realistic malfunctions are built into the application they could result in economic harm, just like real malfunctions. Purposeful economic harm is basically a criminal act, almost by definition.

You may be able to get away with clauses in the licensing agreement which waive civil liability from any damages arising from the operation of the software, but if those damages are by design, you might not be able to waive criminal liability.

Don't even think about stunts like this: make it work as reliably as you can and put fake failure scenarios only into special builds or configurations.

The code that cried wolf

Prunge answers (6 votes):

The problem I see is that if such a program dies, we'll just say "Oh it's just another random termination—nothing to worry about." But what if there is a real problem that needs fixing? It will get ignored.

Programs already "randomly" fail due to developers making mistakes, bugs making it into production systems, hardware failures, etc. When this does occur, we want to know about it so we can fix it. Designing death into programs only increases the probability of failure and would only force us to increase redundancy, which costs money.

I see nothing wrong with killing processes randomly in a test environment when testing a redundant system (this should be happening more than it is) but not in a production environment. Would we pull a couple of hard drives out from a live production system every few days, or deactivate one of the computers on a aircraft as it is flying full of passengers? In a testing scenario—fine. In a live production scenario—I'd rather not.

Find more answers or leave your own at the original post. See more Q&A like this at Programmers, a site for conceptual programming questions at Stack Exchange. And of course, feel free to login and ask a question for yourself.

Reader Comments (42)

View comments on forum

Loading comments...

Channel Ars Technica