|
Activity Number:
|
245
|
|
Type:
|
Invited
|
|
Date/Time:
|
Tuesday, August 4, 2009 : 8:30 AM to 10:20 AM
|
|
Sponsor:
|
Section on Physical and Engineering Sciences
|
| Abstract - #302931 |
|
Title:
|
Reliability in Supercomputing: A Million Processors Cooperating to Solve One Problem
|
|
Author(s):
|
George Ostrouchov*+ and Thomas J. Naughton, III and Stephen L. Scott
|
|
Companies:
|
Oak Ridge National Laboratory and Oak Ridge National Laboratory and Oak Ridge National Laboratory
|
|
Address:
|
P.O. Box 2008, Oak Ridge, TN, 37831,
|
|
Keywords:
|
high performance computing ; parallel computing ; hardware
|
|
Abstract:
|
The world's largest supercomputers currently have hundreds of thousands of processing cores and this will soon surpass a million. When we begin to count other components such as disk, I/O support, memory, bus, etc. we are already at a million before considering the software components. Combining very large numbers of individually highly reliable components can result in something surprisingly unreliable if reliability is not addressed. This talk will describe some of the current supercomputers, their emerging reliability issues, and how they are being addressed. We will include some of our work. This is an area that still has many more questions than answers and is one where solutions will have components based on statistical methods and ideas.
|