Monitoring tools

Continuous monitoring of high performance computing and storage resources and core services is crucial for achieving high level of service reliability. A number of system parameters have to be under constant observation and control: from hardware conditions (temperature of the motherboard and of the CPU, fan speed, disk state) through state of operating system, to Grid-related events and daemons. In the Scientific Computing Laboratory are used several monitoring tools, with a number of them partially or fully developed and publicly available. Here is the list of the monitoring tools used at SCL.

Ganglia:

Ganglia is a scalable distributed monitoring system for high-performance computing systems, such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency.

Pakiti:

Pakiti tool provides a monitoring and notification mechanism for checking the patching status of installed packages on an RPM-based Linux system. It is using a client/server model, in which clients and servers are exchanging information using HTTP(S).

SAM:

SAM (Service Availability Monitoring) is a framework used in EGEE for the monitoring of production Grid sites. It provides a set of probes which are submitted at regular intervals, and a database that stores test results. In effect, SAM provides monitoring of grid services from a user perspective.

BBmSAM:

BBmSam is a web application implemented in PHP and relying on a MySQL database for data storage. This tool has been tested under different web servers (Apache, Microsoft IIS), and can be used with any web server supporting PHP (at list through CGI).

GStat:

GStat is an application designed to monitor EGEE/LCG compatible Information Systems. Its purpose is to detect faults, verify the validity and display useful data from the Information System.

WatG Browser:

The WatG Browser (What is at the Grid Browser) is a web-based Grid Information System (GIS) visualization application providing detailed overview of the status and availability of various Grid resources in a given gLite-based e-Infrastructure. It is able to query and present data obtained from Grid information systems at different layers: from local resource information system for a particular Grid service (GRIS), to the Grid site information system (site BDII), and to the top-level information system for the whole Grid infrastructure (top-level BDII).

CGMT:

Cumulative Grid Monitoring Script (CGMT) is a set of scripts accompanied by the simple web interfaces developed for Grid site monitoring and integrated presentation of the results provided by various monitoring tools. Some of these scripts can be deployed on any general-purpose computing cluster, without the involvement of gLite middleware.

WMSMon:

WMSMON is a tool provides a site independent, centralized, uniform monitoring of gLite-WMS/LB services.