Abstract:
|
Los Alamos National Laboratory is home to several supercomputers. In this paper, we model the hardware reliability of Blue Mountain, the first of a new generation of laboratory supercomputers. Blue Mountain is comprised of 48 shared memory processors (SMPs), which act like 48 repairable systems in series. When a hardware failure is detected, the failed part is restored to working condition and the SMP is made available for use. Ryan and Reese (2001) modeled hardware failure data from Blue Mountain using a non-homogeneous Poisson process with an intensity function which permits reliability growth with a constant limiting failure rate. A hierarchical specification for the parameters governing the intensity function completed the model. In this paper, we extend the Ryan-Reese model by incorporating an exposure variable in the Level I Poisson process. Including the exposure variable allows the model to reflect the varying levels of usage the different SMPs received. The remaining hierarchical specification permits borrowing of strength across SMPs and the incorporation of expert knowledge in the final level of the model.
|