Close

Presentation

CANARI: A Monitoring Framework for Cluster Analysis and Node Assessment for Resource Integrity
DescriptionResearch computing facilitators must balance providing the most up-to-date versions of software while also ensuring that the software ecosystem is stable enough that version changes do not cause performance degradation to existing workflows. Additionally, the data centers where these ecosystems are running are intricately complex systems with many points of failure. These challenges inspire the need for tools that ensure these systems continue to perform at their expected levels. Here we present such a tool in a framework for Cluster Analysis and Node Assessment for Resource Integrity called CANARI. CANARI was developed and used at the Rosen Center for Advanced Computing to monitor the availability of nodes in our clusters as well as their performance against synthetic benchmarks, ingest that performance data into a persistent database, mark nodes displaying performance regression offline, and provide summary reports and real-time alerts to the Slack instance used at RCAC by using Slack's API.
Event Type
Workshop
TimeFriday, 22 November 202410:30am - 11am EST
LocationB311
Tags
State of the Practice
System Administration
Registration Categories
W