[Tfug] ECC (was Re: Using a Laptop as a server)

Fri Mar 15 19:35:06 MST 2013

All this talk about ECC has got me thinking.  I don't think I have ever used a system that has ECC.  Maybe the computers at the UofA computer center in 1983.  Maybe the Novell servers circa 1995.

Today?  Not sure.  I do light management on 3 web server.  I would guess they do not have ECC memory.  I'm going to check.

There is that error once in a while that I cannot explain, however they are few and far between.

------------------------

Keith Smith

--- On Thu, 3/14/13, Louis Taber <ltaber at gmail.com> wrote:

From: Louis Taber <ltaber at gmail.com>
Subject: Re: [Tfug] ECC (was Re: Using a Laptop as a server)
To: "Tucson Free Unix Group" <tfug at tfug.org>
Date: Thursday, March 14, 2013, 7:44 AM

Hi All,
Using the data from the 2004 paper Soft Errors in Electronic Memory – A White Paper at http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf and my system with 16GBytes of RAM up for 176 days and making two seriously invalid assumptions I computed an estimate of 187 soft errors. 2^(10+4)*9*(10^-9)*(176*24)*300

The "bad" assumptions were 1) I am using all of the RAM and 2) It is being used a full speed.The other assumption was FIT of 300 (FIT Failures In Time: Errors per billion (10^9 ) hours of use. Usually reported as FIT per Mbit.)
The text suggested FIT rates of few hundred to few thousand.
Are you willing to:Use the wrong data in a calculation.
Execute an incorrect instruction,Use an invalid address or pointer? The "cost" to prevent this by using ECC seems to include:
Slower execution (You need to write the entire ECC word at one time, no 8, 16, or 32 bit writes)More expensive memory (It typically is an extra bit per byte)More expensive processors and system (it takes circuitry to implement the ECC)
A general increase in the quality of hardware construction (If you are going to sell a system that supports ECC you need to appeal to customers who will buy it) 

Most PC ECC systems I looked at around year 2000 would also catch all double bit errors and all errors in a single nibble.  I would rather have a process or system stop if an error is encountered.  If they are computing the syndrome over 128+ bits it could be even better.  IBM mainframe systems at the time corrected all double bite errors.

Does Linux, by default, log ECC errors?  If so where?  If not, how logging be turned on?
  - Louis  

On Thu, Mar 14, 2013 at 12:42 AM, Bexley Hall <bexley401 at yahoo.com> wrote:

Hi Harry,

On 3/13/2013 5:42 PM, Harry McGregor wrote:

I would have two issues with a laptop as a server (and yes, I have done

it in the past myself).

Lack of ECC memory. <--- Memory errors scare me enough that I try and

use ECC even on desktop/workstation level systems

What sorts of error rates do you encounter (vs time vs array size)?

And, more importantly, what *policies* do you have in place to

deal with the appearance of correctable and uncorrectable errors?

I've never deployed ECC in an embedded system.  Primarily, because

RAM requirements have never been high and because RAM == DATA

(not TEXT!  Think:  XIP).  I.e., if you assume your code is

fetched correctly (if the memory error is a consequence of a

failure of the actual memory *interface*, then all bets are

off, regardless of ECC!), then *it* can validate the data on

which it is operating.

The automation/multimedia system uses the most "writable" memory

of any system I've designed, to date.  And, dynamically loads

code so now RAM is not *just* DATA but TEXT as well!  (this is

slightly inaccurate but not important enough to clarify).

I've been planning on ~1 FIT / MB / year as a rough goal.  So,

an error or two per day is a conservative upper bound.  [Hard to

get data on the types of devices I use so this is just a SWAG]

I assume errors are *hard*/repeatable.  So, "correcting" the error

doesn't really buy anything -- it means that "location" now has

no error *correction* capability (since that bit is already requiring

correction so any *other* bits exhibiting failures *can't* be

corrected!  Unless I used a larger syndrome)

As such, I favor parity over ECC (especially as ECC severely limits

the implementation choices available to me -- parity can be "bolted

on"... sometimes  :> ).

I count on invariants sprinkled literally through my code to identify

likely "bad data".  But, since most of my applications are effectively

periodic tasks, I can restart them when such a problem manifests

(and hope for the best).

Runtime diagnostics (e.g., in this case, memory scrubbers) try to

identify persistent failures so I can mark that section of memory

as "not to be used" and take it out of the pool.

I figure the bottom line to the *user* is to complain when

self-diagnosed reliability falls to a point where I have little

faith in being able to continue operating as intended.  Effectively

failing a POST after-the-fact.  At which point, the device in

question will need to be replaced (cheaper than supporting

replaceable memory).

I imagine the same sort of approach is used in server farms?

I.e., when reported corrected errors exceed some threshold, the

device in question (a DIMM, in that case) is replaced?  And, the

server "noticed" as potentially prone to future problems?

(e.g., if the memory errors are related to something in the

DIMM's "operating environment").

Or, is this done on a more ad hoc basis?  Is there ever a post

mortem performed on the suspected failed devices?  Or, are they

just considered as "consummables"?

Thx!

--don

P.S.  The sandwich place has proven to be a big hit with the

folks to whom I've suggested it!  Thanks!

_______________________________________________

Tucson Free Unix Group - tfug at tfug.org

Subscription Options:

https://www.tfug.org/mailman/listinfo/tfug

-----Inline Attachment Follows-----

_______________________________________________
Tucson Free Unix Group - tfug at tfug.org
Subscription Options:
https://www.tfug.org/mailman/listinfo/tfug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tfug.org/pipermail/tfug_tfug.org/attachments/20130315/5f17a78f/attachment-0002.html>