Email or username:

Password:

Forgot your password?
Top-level
Gabriele Svelto

Diagnosing it is hard. Windows has a memory diagnostic tool which will catch the worst offenders and is easy to use: microsoft.com/en-us/surface/do

It's not enough though, some errors can only be caught with more extensive testing. I recommend the open-source memtest86+ (memtest.org/) tool or the closed source memtest86 one (memtest86.com/) 8/17

12 comments
Gabriele Svelto

Naturally what happens on PCs also happens on phones, network devices, printers, TVs, etc... but you can't test them. This is a disaster because these failures are common, and they become more and more common as the device ages. If we want to have repairable devices that last for a long time, the industry will have to change its practices, but more about this later. 9/17

Gabriele Svelto

Now you might wonder: how often does this actually happen? The common wisdom on this topic is that hardware failures are so rare that software bugs will always dwarf them. As I found out this is demonstrably false.

While investigating Firefox crashes I've come to the conclusion that several of the most common issues we were dealing with were likely caused by flaky hardware. This led me to come up with a simple heuristic to detect crashes potentially caused by bit-flips. 10/17

Gabriele Svelto

Deploying this heuristic to Mozilla's crash reporting infrastructure has been eye opening: if I take the 10 most common crashes on Windows, 7 are out-of-memory conditions - that is, not bugs - and 3 are likely caused by bad memory.

You've read that right, three out of the ten most common reasons why Firefox crashes on Windows are caused by memory that's gone bad. 11/17

Gabriele Svelto replied to Gabriele

Now there's a few things that are worth mentioning: users with bad hardware will be over-represented in this category, their machines will crash far more often than others.

The second thing is that Firefox is exceptionally stable, we've driven down its crash rate by more than >70% in the last few years. But Firefox is also a 30 million-lines-of-code monster. There are bugs in there, but they're less common than hardware failures! 12/17

Gabriele Svelto replied to Gabriele

Plotting these types of crashes against time yields interesting trends: the more machines age the more likely they are to encounter hardware-related failures. You might think that's obvious, and indeed it is, but until now the industry has looked the other way, based on the hand-wavy excuse that hardware failures were less common than bugs. 13/17

Gabriele Svelto replied to Gabriele

So what needs to change? First of all, error detection and correction must become commonplace. You can already build a desktop machine with #ECC memory, but it's uncommon in laptops, even mobile workstations, and completely absent on phones and other consumer appliances. This will measurably lengthen the usable life of these devices. 14/17

Gabriele Svelto replied to Gabriele

Note that detection is more important than correction. The user needs to know that there's something wrong without having to run a memory testing program. Think of the lights that turn on in cars if something's malfunctioning, or the error beeps that your washing machine makes when it thinks it's leaking water. These are extremely common, they need to be on computing devices too. 15/17

Gabriele Svelto replied to Gabriele

Finally hardware design must change to make devices repairable and prolong their useful life. Yes, I'm looking at non-ECC memories soldered on the motherboard or worse, on the same substrate as the CPU. 16/17

Gabriele Svelto replied to Gabriele

To end the thread I'd like to thank my colleagues Alex Franchuk and @willcage who did the implementation work and my boss Gian-Carlo Pascutto who plotted crashes against machine age. I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17

Gorgeous na Shock! replied to Gabriele

@gabrielesvelto My dream for a while had been that ECC memory becomes as commonplace as encryption suddenly did circa ~2013 and not just some weird thing only I and a few of my nerd friends do because we're overcautious and weird. 😌

William D. Jones replied to Gabriele

@gabrielesvelto Really depressing that we've reached the physical limits of creating "memory we're confident that actually will store it's value reliably" :(.

We've went from PARITY CHECK 1/2 to "memory works fine without detection or correction" to "oh now not even parity check is enough". In that sense, it's WORSE than 40 years ago :P.

Go Up