[Tfug] ECC (was Re: Using a Laptop as a server)
keith smith
klsmith2020 at yahoo.com
Fri Mar 15 19:35:06 MST 2013
All this talk about ECC has got me thinking. I don't think I have ever used a system that has ECC. Maybe the computers at the UofA computer center in 1983. Maybe the Novell servers circa 1995.
Today? Not sure. I do light management on 3 web server. I would guess they do not have ECC memory. I'm going to check.
There is that error once in a while that I cannot explain, however they are few and far between.
------------------------
Keith Smith
--- On Thu, 3/14/13, Louis Taber <ltaber at gmail.com> wrote:
From: Louis Taber <ltaber at gmail.com>
Subject: Re: [Tfug] ECC (was Re: Using a Laptop as a server)
To: "Tucson Free Unix Group" <tfug at tfug.org>
Date: Thursday, March 14, 2013, 7:44 AM
Hi All,
Using the data from the 2004 paper Soft Errors in Electronic Memory – A White Paper at http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf and my system with 16GBytes of RAM up for 176 days and making two seriously invalid assumptions I computed an estimate of 187 soft errors. 2^(10+4)*9*(10^-9)*(176*24)*300
The "bad" assumptions were 1) I am using all of the RAM and 2) It is being used a full speed.The other assumption was FIT of 300 (FIT Failures In Time: Errors per billion (10^9 ) hours of use. Usually reported as FIT per Mbit.)
The text suggested FIT rates of few hundred to few thousand.
Are you willing to:Use the wrong data in a calculation.
Execute an incorrect instruction,Use an invalid address or pointer? The "cost" to prevent this by using ECC seems to include:
Slower execution (You need to write the entire ECC word at one time, no 8, 16, or 32 bit writes)More expensive memory (It typically is an extra bit per byte)More expensive processors and system (it takes circuitry to implement the ECC)
A general increase in the quality of hardware construction (If you are going to sell a system that supports ECC you need to appeal to customers who will buy it)
Most PC ECC systems I looked at around year 2000 would also catch all double bit errors and all errors in a single nibble. I would rather have a process or system stop if an error is encountered. If they are computing the syndrome over 128+ bits it could be even better. IBM mainframe systems at the time corrected all double bite errors.
Does Linux, by default, log ECC errors? If so where? If not, how logging be turned on?
- Louis
On Thu, Mar 14, 2013 at 12:42 AM, Bexley Hall <bexley401 at yahoo.com> wrote:
Hi Harry,
On 3/13/2013 5:42 PM, Harry McGregor wrote:
I would have two issues with a laptop as a server (and yes, I have done
it in the past myself).
Lack of ECC memory. <--- Memory errors scare me enough that I try and
use ECC even on desktop/workstation level systems
What sorts of error rates do you encounter (vs time vs array size)?
And, more importantly, what *policies* do you have in place to
deal with the appearance of correctable and uncorrectable errors?
I've never deployed ECC in an embedded system. Primarily, because
RAM requirements have never been high and because RAM == DATA
(not TEXT! Think: XIP). I.e., if you assume your code is
fetched correctly (if the memory error is a consequence of a
failure of the actual memory *interface*, then all bets are
off, regardless of ECC!), then *it* can validate the data on
which it is operating.
The automation/multimedia system uses the most "writable" memory
of any system I've designed, to date. And, dynamically loads
code so now RAM is not *just* DATA but TEXT as well! (this is
slightly inaccurate but not important enough to clarify).
I've been planning on ~1 FIT / MB / year as a rough goal. So,
an error or two per day is a conservative upper bound. [Hard to
get data on the types of devices I use so this is just a SWAG]
I assume errors are *hard*/repeatable. So, "correcting" the error
doesn't really buy anything -- it means that "location" now has
no error *correction* capability (since that bit is already requiring
correction so any *other* bits exhibiting failures *can't* be
corrected! Unless I used a larger syndrome)
As such, I favor parity over ECC (especially as ECC severely limits
the implementation choices available to me -- parity can be "bolted
on"... sometimes :> ).
I count on invariants sprinkled literally through my code to identify
likely "bad data". But, since most of my applications are effectively
periodic tasks, I can restart them when such a problem manifests
(and hope for the best).
Runtime diagnostics (e.g., in this case, memory scrubbers) try to
identify persistent failures so I can mark that section of memory
as "not to be used" and take it out of the pool.
I figure the bottom line to the *user* is to complain when
self-diagnosed reliability falls to a point where I have little
faith in being able to continue operating as intended. Effectively
failing a POST after-the-fact. At which point, the device in
question will need to be replaced (cheaper than supporting
replaceable memory).
I imagine the same sort of approach is used in server farms?
I.e., when reported corrected errors exceed some threshold, the
device in question (a DIMM, in that case) is replaced? And, the
server "noticed" as potentially prone to future problems?
(e.g., if the memory errors are related to something in the
DIMM's "operating environment").
Or, is this done on a more ad hoc basis? Is there ever a post
mortem performed on the suspected failed devices? Or, are they
just considered as "consummables"?
Thx!
--don
P.S. The sandwich place has proven to be a big hit with the
folks to whom I've suggested it! Thanks!
_______________________________________________
Tucson Free Unix Group - tfug at tfug.org
Subscription Options:
http://www.tfug.org/mailman/listinfo/tfug_tfug.org
-----Inline Attachment Follows-----
_______________________________________________
Tucson Free Unix Group - tfug at tfug.org
Subscription Options:
http://www.tfug.org/mailman/listinfo/tfug_tfug.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tfug.org/pipermail/tfug_tfug.org/attachments/20130315/5f17a78f/attachment-0002.html>
More information about the tfug
mailing list