Polygraph: System for Dynamic Reduction of False Alerts in Large-Scale IT Service Delivery Environments

Today’s large-scale IT service delivery systems encompass multiple data centers, geographical locations, diverse hardware and software platforms. Services are no longer confined to racks within a single data center -- they may often be deployed and served from multiple locations. Further, with the increasing adoption of virtualization and cloud computing, the management of large-scale IT infrastructure is increasingly the focus for data center optimization and innovation. Of the various service management tasks such as incidents, problems, changes, and patches, handling of incidents is often a major portion of the work performed by the system administrators managing the system components. In this paper, we focus our attention to monitoring alerts which are triggered by agents monitoring the health of the system components based on pre-set thresholds. A fraction of these alerts get converted into service tickets that must be investigated and resolved within a specified time duration. Our data, collected from a very large IT service environment, as well as previous studies indicate that a significant portion of these incidents can be false, often as high as half the total volume of alerts, resulting in wasted work investigating them.

In this paper, we describe Polygraph, a system for reducing false alerts and incidents. Polygraph works by mining historical incidents and alerts, and by correlating them with other historical data such as system health time-series data, server similarities, operational context of servers and other sources. The resulting output is a set of monitoring policies with projected accuracies and rates of false alert reduction if deployed in the environment. Polygraph is unique in that it uses an active learning approach with these outputs. It presents policies with projected low scores to system administrators for verification instead of automatically deploying them. Polygraph uses a four-step process: In the first step, it attempts to detect alerts that can safely be removed (i.e. false alerts); second, it generates a set of candidate policies to achieve this purpose by estimating new thresholds; next, it calculates, via simulation, a projected savings in terms of false alerts that would be removed from the environment while ensuring that true incidents are not missed; and finally, upon verification by system administrators, these new policies are dispatched to monitoring servers that further push them to individual components. We evaluate Polygraph with a real-life trace of around 60K incidents collected over 30 days from a portion of a large IT service delivery infrastructure. We divide the older traces for learning purpose, and use the more recent traces for testing the effectiveness of the new policies generated by Polygraph. Our results indicate significant reduction of false alerts while keeping the number of missing true events (i.e. false negatives) to a minimum. We also discuss several ways Polygraph can be extended to increase its false alert detection rates and overall effectiveness.

By: Sangkyum Kim, Winnie Cheng, Shang Guo, Laura Luan, Daniela Rosu, Abhijit Bose

Published in: RC25174 in 2011


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .