Crying Wolf and Meaning It: Reducing False Alarms in Monitoring of Sporadic Operations through POD-Monitor

Xiwei Xu; Liming Zhu; Min Fu; Daniel Sun; An Binh Tran; Paul Rimba; Srini Dwarakanathan; Len Bass

doi:10.1109/COUFLESS.2015.18

Abstract

When monitoring complex applications in cloud systems, a difficult problem for operators is receiving false positive alarms. This becomes worse when the system is sporadically being changed and upgraded due to the emerging continuous deployment practice. Other legitimate but sporadic maintenance operations, such as log compression, garbage collection and data reconstruction in distributed systems can also trigger false alarms. Consequently, traditional baseline-based anomaly detection and monitoring is less effective. A normal but dangerous practice is to turn off normal monitoring during sporadic operations such as upgrade and maintenance. In this paper, we report on the use of the process context information of sporadic operations to suppress false positive alarms. We use the context information both directly and in machine learning. Our experimental evaluation shows that 1) using process context directly improves the alarm precision up to 0.226 (36.1% improvement), 2) using process-context trained machine learning models improves the precision rate up to 0.421 (84.7% improvement).

Crying Wolf and Meaning It: Reducing False Alarms in Monitoring of Sporadic Operations through POD-Monitor

Authors

Abstract

Related Articles