To replace heterogeneous legacy monitoring systems, a central state-of-the-art monitoring platform is to be set up. On the one hand, this ensures that the customer's entire IT infrastructure is monitored and, on the other hand, that monitoring services can be used for all internal products. A central requirement for the platform is high availability, so that all components from the monitoring stack (Grafana, InfluxDB, Telegraf, Graylog and Icinga) and the associated backend tools and DBs were redundantly provisioned by the internal monitoring team on a dedicated VMware cluster. The implementation of the project was based on the agile methodologies and in particular on Scrum and was managed by PTA, as well as being supported and advised by PTA.
Supplement
The high availability is designed and implemented as follows. In front of Grafana and Graylog a HAProxy-Loadbalancer with a Keepalived Daemon is switched. The central metrics database InfluxDB is suspended as a cluster in the enterprise version. The metadata database for the Grafana dashboards is set up as a MariaDB Galera Cluster. Furthermore, MariaDB is also used as a metadata DB for the alerting platform Icinga. Alerting for isolated systems in foreign network segments is implemented by an Icinga satellite. For the metric collection, telegraph agents are provisioned on the respective server (VM) using Puppet. In addition, agent configurations are also distributed automatically using Puppet.
Subject description
Monitoring is divided into two main divisions: the functional and technical system monitoring of servers, services, software and middleware and log and analysis monitoring. In addition to replacing the legacy alerting systems, system monitoring should enable client-capable monitoring of internal products in the form of monitoring services. The internal customers can independently compile a dashboard for their product on the platform and implement the corresponding metrics. Log monitoring is a new central platform for log analysis, which enables administrators and developers to perform an efficient analysis as well as professional monitoring.