JSM 2013 Home
Online Program Home
My Program

Abstract Details

Activity Number: 670
Type: Topic Contributed
Date/Time: Thursday, August 8, 2013 : 10:30 AM to 12:20 PM
Sponsor: Section on Physical and Engineering Sciences
Abstract - #309061
Title: An Approach for Predictive Fault Isolation in High-Performance Computing Systems
Author(s): David Robinson*+ and Jon Stearley
Companies: Sandia National Labs and Sandia National Laboratories
Keywords: reliability ; masked data ; Bayesian ; high performance computing
Abstract:

In this paper, we consider the identification of faults in high performance computing systems. While the computer components allocated for job completion in an HPC system are known, only an unknown subset of those components are actually used in a job. Further, the specific components responsible for job interrupt can be masked (unobservable). A hierarchical Bayesian model is proposed to characterize the probability that each HPC component contributes to job interruption. The output of this model can then be used to schedule the resources of jobs so as to minimize the likelihood of job interrupt, schedule maintenance resources more efficiently, and schedule jobs to reduce the uncertainty in fault identification. Data from a simulated HPC system is used to demonstrate the methodology.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2013 program




2013 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Continuing Education program, please contact the Education Department.

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.

ASA Meetings Department  •  732 North Washington Street, Alexandria, VA 22314  •  (703) 684-1221  •  meetings@amstat.org
Copyright © American Statistical Association.