I think most users expect Netflix to be up 100% of the time, in an effort to make sure that is possible Netflix has implemented a rather extreme approach to making sure its happy that it can handle server errors.
They've implemented what is called a Choas Monkey. This will randomly throw a tantrum in their system and they will see how it affects performance as well as other stuff like security.
The article is a bit technical but a good read and they are very open with the results.
With Spark Streaming as our choice of stream processor, we set out to evaluate and share the resiliency story for Spark Streaming in the AWS cloud environment. A Chaos Monkey based approach, which randomly terminated instances or processes, was employed to simulate failures.