Reliability Research - Fortune or Fallacy

Overall theme (Antonio Gonzalez):

The purpose of this panel is to debate the relevance of reliability research for computer architects. In the past, reliability had been addressed through fabrication (i.e., burn-in, testing) and circuit techniques, whereas microarchitecture techniques had been primarily focused just on mission-critical systems. However, over the past 5-10 years, reliability has moved more into the mainstream of computer architecture research. On the one hand, transient and permanent faults are a looming problem due to CMOS scaling that must be solved. In a recent keynote, Shekhar Borkar summed up the emerging design space as follows: "Future designs will consist of 100B transistors, 20B of which are unusable due to manufacturing defects, 10B will fail over time due to wearout, and regular intermittent errors will be observed." This vision clearly suggests that fault-tolerance must become a first-class design feature.

On the other hand, some people believe that reliability provides little added value for the bulk of computer systems that are sold today. They claim that researchers have artificially enhanced the magnitude of the problems to increase the perceived value of their work. In reality, unreliable operation has been accepted by consumers as common place and significant increases in hardware failure rates will have little effect on the end user experience. Reliability is simply a tax that the doom-sayers want to levy on your computer system.

This panel will confront these two points of view, through two world-class researchers in computer architecture: Scott Mahlke and Shubu Mukherjee.

Fortune viewpoint (Shubu Mukherjee):
Captain Jean-Luc Picard of the starship USS Enterprise once said that there are three versions of the truth: your truth, his truth, and the truth.  An end-user's truth is that occasional failures are just a nuisance, but not a major showstopper.  Well, this is, of course, unless it happens at an inconvenient moment, such as when Windows 98 crashed during a Bill Gates demo.  The truth, however, is very different from the perspective of an IT manager who has to deal with 1000s of end-users. The greater the number of end-user complaints per day, the greater is her company's total cost of ownership for these machines. And, the God-given truth is that silicon reliability is getting increasingly worse with every generation revealing the dark side of Moore's Law.  

I will argue that fault tolerance has now become a mainstream architecture consideration in most silicon chips industry will produce. The goal of hardware vendors is to maintain the hardware error rate low enough that they continue to get obfuscated by software crashes and bugs. This is becoming increasingly challenging because on average software is getting more reliable, while silicon reliability is rapidly getting worse. Radiation-induced soft errors, process-related instability, wearout, and variability are introducing new challenges and risks to chip design that the industry has never comprehended before. To make matters worse, architecture research in reliability is fraught with misconceptions and fallacies. It will take an increasing amount of discipline, education, and research to get over these hurdles.

Fallacy viewpoint (Scott Mahlke):
How often does your laptop crash?  Bill Gates has stated that 5 percent of Windows machines crash, on average, twice daily. To put this another way, any given machine will crash about three times a month.  How often do you lose a call on your cell phone?  How often is a word garbled where you have to ask the person on the other end to repeat something?  Would you care if a pixel in a frame of your video was the wrong color?  The majority of consumers care little about the reliable operation of electronic devices, and the concerns are dropping as these devices become more disposable.  In 2006, the average lifetime of a business cell phone was 9 months.  Building devices whose hardware functions flawlessly for 20 years simply does not make economic sense.  Further, the processor is one of the least likely sources of faults.  Third party software, operating systems, disks, memory, and LCD screens all have much lower reliability.  Why then are computer architects spending so much effort to build highly reliable computer systems?  Even if silicon faults scale up by an order of magnitude in the coming years, the end user is unlikely to see any difference because the reliability of the overall system is dominated by other factors.  Further, most devices will be replaced before wearout defects can manifest.  In this panel, I will present the top 5 reasons computer architecture research in reliability is a fallacy.