Managing IT infrastructure is complex and costly because of the many variation points at all levels of a system stack (hardware, network, middleware, application, etc). Even with the best calibration of the system configuration, incidents happen due to unforeseen loads, hardware defects, application errors, etc.
To minimize service outages from incidents, variables on all levels of the system stack (disk usage, database hit ratio, etc.) are continuously monitored to detect the buildup of incidents as early as possible. Furthermore, to minimize the time until an incident is resolved and to minimize costs, incident responses such as system clean-up, component re-start and re-configuration are partially automated.
To support incident response automation at scale, we are also developing an optimized representation of the diagnosis and remediation process that is used as incident response and is amenable to both human engineering and machine learning. This process representation features optimized decision trees and automated planners.