Distributed Predictive Performance Anomaly Detection for Virtualized Clouds
AUTHORS: Ali Jehangiri, Ramin Yahyapour, Philipp Wieder, Edwin Yaqub
An increasing number of applications are being hosted on cloud based plat-forms . Cloud platforms are serving as a general computing facility and applications being hosted on these platforms range from simple multi-tier web applications to complex social networking, eCommerce and BigData applications. High availability, performance and auto-scaling are key requirements of Cloud based applications. Cloud platforms serve these requirements using dynamic provisioning of resources in on-demand, multi-tenant fashion. A key challenge for cloud service providers is to ensure the Quality of Service (QoS), as a user / customer requires more explicit guarantees of QoS for provisioning of services. Cloud service performance problems can directly lead to extensive nancial loses. Thus, control and veri cation of QoS become a vital concern for any production level deployment. There-fore, it is crucial to address performance as a managed objective. The success of cloud services depends critically on automated problem diagnos-tics and predictive analytics enabling organizations to manage their perfor-mance proactively. Moreover, e ective and advance monitoring is equally important for performance management support in clouds. In this thesis, we explore the key techniques for developing monitoring and performance management systems to achieve robust cloud systems. At rst, two case studies are presented as a motivation for the need of a scalable monitoring and analytics framework. It includes a case study on performance issues of a software service, which is hosted on a virtual-ized platform. In the second case study, cloud services are analyzed that are o ered by a large IT service provider. A generalization of case studies forms the basis for the requirement speci cations which are used for state-of-the-art analysis. Although, some solutions for particular challenges have already been provided, a scalable approach for performance problem diag-nosis and prediction is still missing. For addressing this issue, a distributed scalable monitoring and analytics framework is presented in the rst part of this thesis. We conducted a thorough analysis of technologies to be used by our framework. The framework makes use of existing monitoring and analytics technologies. However, we develop custom collectors to retrieve data non-intrusively from di erent layers of cloud. In addition, we de-velop the analytics subscriber and publisher components to retrieve service related events from di erent APIs and sends alerts to the SLA Manage-ment component for taking corrective measures. Further, we implemented an Open Cloud Computing Interface (OCCI) monitoring extension using OCCI Mixin mechanism. To deal with performance problem diagnosis, a novel distributed parallel approach for performance anomaly detection is presented. First all anomalous metrics are found from a distributed database of time-series fora particular window. For comparative analysis three light-weight statistical anomaly detection techniques are selected. We extend these techniques to work with MapReduce paradigm and assess and compare the methods in terms of precision, recall, execution time, speedup and scale up. Next, we correlate the anomalous metrics with the target SLO in order to locate the suspicious metrics. We implemented and evaluated our approach on a production Cloud encompassing Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) service models. Experimental results con rm that our approach is e cient and e ective in capturing the metrics causing performance anomalies. Finally, we present the design and implementation of an online anomaly prediction system for cloud computing infrastructures. We further present an experimental evaluation of a set of anomaly prediction methods that aim at predicting upcoming periods of high utilization or poor performance with enough lead time to enable the appropriate scheduling, scaling, and migration of virtual resources. Using real data sets gathered from Cloud platforms of a university data center, we compare several approaches ranging from time-series (e.g. auto regression (AR)) to statistical classi cation methods (e.g. Bayesian classi er). We observe that linear time-series models, especially AR models, are most likely suitable to model QoS measures and forecast their future values. Moreover, linear time-series models can be integrated with Machine Learning (ML) methods to improve proactive QoS management.