2015 IEEE/ACM 1st International Workshop on Complex Faults and Failures in Large Software Systems (COUFLESS)
Download PDF

Abstract

When monitoring complex applications in cloud systems, a difficult problem for operators is receiving false positive alarms. This becomes worse when the system is sporadically being changed and upgraded due to the emerging continuous deployment practice. Other legitimate but sporadic maintenance operations, such as log compression, garbage collection and data reconstruction in distributed systems can also trigger false alarms. Consequently, traditional baseline-based anomaly detection and monitoring is less effective. A normal but dangerous practice is to turn off normal monitoring during sporadic operations such as upgrade and maintenance. In this paper, we report on the use of the process context information of sporadic operations to suppress false positive alarms. We use the context information both directly and in machine learning. Our experimental evaluation shows that 1) using process context directly improves the alarm precision up to 0.226 (36.1% improvement), 2) using process-context trained machine learning models improves the precision rate up to 0.421 (84.7% improvement).
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles