Step 1: Availability metrics
Traditionally, percentage of time system is up
- time-averaged, binary view of system state (up/down)
This metric is inflexible
- doesn’t capture degraded states
- a non-binary spectrum between “up” and “down”
- time-averaging discards important temporal behavior
- compare 2 systems with 96.7% traditional availability:
- system A is down for 2 seconds per minute
- system B is down for 1 day per month
- Our solution: measure variation in system quality of service metrics over time
- performance, fault-tolerance, completeness, accuracy