Abstract Details
Activity Number:
|
670
|
Type:
|
Topic Contributed
|
Date/Time:
|
Thursday, August 8, 2013 : 10:30 AM to 12:20 PM
|
Sponsor:
|
Section on Physical and Engineering Sciences
|
Abstract - #309061 |
Title:
|
An Approach for Predictive Fault Isolation in High-Performance Computing Systems
|
Author(s):
|
David Robinson*+ and Jon Stearley
|
Companies:
|
Sandia National Labs and Sandia National Laboratories
|
Keywords:
|
reliability ;
masked data ;
Bayesian ;
high performance computing
|
Abstract:
|
In this paper, we consider the identification of faults in high performance computing systems. While the computer components allocated for job completion in an HPC system are known, only an unknown subset of those components are actually used in a job. Further, the specific components responsible for job interrupt can be masked (unobservable). A hierarchical Bayesian model is proposed to characterize the probability that each HPC component contributes to job interruption. The output of this model can then be used to schedule the resources of jobs so as to minimize the likelihood of job interrupt, schedule maintenance resources more efficiently, and schedule jobs to reduce the uncertainty in fault identification. Data from a simulated HPC system is used to demonstrate the methodology.
|
Authors who are presenting talks have a * after their name.
Back to the full JSM 2013 program
|
2013 JSM Online Program Home
For information, contact jsm@amstat.org or phone (888) 231-3473.
If you have questions about the Continuing Education program, please contact the Education Department.
The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.
Copyright © American Statistical Association.