[CSEE Talk] talk: Emerging Challenges in High Performance Computing, 1pm Fri 11/9, ITE227, UMBC

Tim Finin finin at cs.umbc.edu
Tue Oct 23 23:54:34 EDT 2012


			   CSEE Colloquium

	  Emerging Challenges in High Performance Computing:
	   Resilience and the Science of Embracing Failure

			     John T. Daly
	      Advanced Computing Systems Program at the
        Department of Defense / Center for Exceptional Computing

	    1:00pm Friday, 9 November 2012, ITE 227, UMBC

Resilience is about keeping the application workload running to a
correct solution in a timely and efficient manner in spite of system
failures. Future extreme scale supercomputers are likely to suffer
more frequent failures than current systems: As devices scale, they
are more susceptible to upsets due to radiation and to errors due to
manufacturing variances. The probability of multiple bit upsets is
growing, since an event is increasingly likely to impact multiple
nearby cells. The use of near-threshold voltage in order to reduce
power consumption also increases error rates. Thus, we can expect more
frequent hardware failures, and a significant rate of undetected soft
errors. While it is desirable to have failure-free system hardware and
software, this goal may not be achievable at reasonable cost as both
hardened components and methodologies to design and test critical
software tend to be extremely expensive. The challenge is to construct
a system out of less than perfectly reliable hardware and software
that nevertheless behaves as a reliable system from the perspective of
the user.

John T. Daly is a computer systems researcher for the Advanced
Computing Systems (ACS) Program at the Department of Defense / Center
for Exceptional Computing (CEC). He is focused on the problem of
keeping supercomputer applications running toward a correct solution
in a timely and efficient manner in the presence of system
degradations and failures. His research interests include mathematical
modeling and analysis of failure, reliability, fault tolerance,
calculational correctness, and throughput for applications at extreme
scale. Before coming to the CEC, John was a researcher and resilience
technical leader in the High Performance Computing (HPC) division at
Los Alamos National Laboratory and a software engineer and application
analyst for Raytheon Intelligence and Information Systems. He is a
nationally recognized expert in resilience with 25 years of experience
developing, porting, and running applications as an early adopter of
many of the world's fastest supercomputers. He holds degrees in
engineering and applied science and aerospace engineering from Caltech
and Princeton University.

     -- more information and directions: http://bit.ly/UMBCtalks --



More information about the CSEE-colloquium-out mailing list