[Tfug] "Downgrading" ("underclocking?") processors
John Hubbard
ender8282 at yahoo.com
Thu Feb 20 18:23:19 MST 2014
On 02/20/2014 03:09 AM, Bexley Hall wrote:
> Hi John,
>
> Sheesh! I'll be juicing oranges for a *month*! :<
>
> On 2/19/2014 9:44 PM, John Hubbard wrote:
>> On 02/19/2014 09:29 PM, Bexley Hall wrote:
>>> Hi John,
>>>
>>> My goal is to have a design that doesn't crap out because something
>>> blocked an intake (fur balls/dust bunnies clogging an input filter)
>>> or a fan gave up the ghost, etc. (come home to find CPU has *melted*
>>> and everything it was expected to do in your absence didn't get done!)
>>
>> Modern systems will throttle performance when things get too hot.
>
> What *exactly* are you claiming? "System" is a vague term. :>
By system I meant modern Intel/AMD processor, motherboard, and memory.
> Can I take a box off the shelf, write ANY SOFTWARE I WANT to run
> on that BARE METAL and be assured that the machine will protect
> itself *and* guarantee a specific level of performance (if so,
> what EXACTLY is that?) regardless of temperature?
If it is a modern Intel/AMD based system: yes it will protect itself. I
believe that all of the protection happens at the BIOS level or below.
>
> IME, things like disks *may* spin down -- but won't automagically
> spin back *up* (i.e., "system software" needs to be aware of this
> characteristic of this drive and *know* how and when to try to
> spin it up).
>
> This should be easy to test! Get a sacrificial system. Write some
> code to do something that ensures a steady workload (write to
> pseudorandom memory addresses to keep the cache cold; push pseudorandom
> buffers of data onto disk; send a sequence number out a serial port;
> repeat -- forever). Then, tape the airholes closed and let it sit for:
> - 10 hours (a normal "work day" while you're away from home)
> - 48 hours (a "weekend away")
> - 72 hours (a three-day weekend)
> - 168 hours (a one week vacation)
> - 336 hours (two weeks away)
> and see:
> - *if* it is still running at the end of that interval
> - if it has continued working at the same "rate" over that interval
> - if cycling power allows it to recover
>
> Hmmm... I think UofA auction was yesterday. So, I'll have to wait
> two weeks before I'll have a chance to find something "disposable"
> to experiment on. Or, maybe I'll try WC to see if they have a
> couple of "scrap" machines that I can toast.
>
> See what happens when machine sits idle with no ventilation.
> Running a full workload on bare metal.
> Running a "modern OS" (Windows/Linux/*BSD) with the same full workload.
> Same experiments with fans unplugged (system should be able to *sense*
> this BEFORE it ever starts to heat up! What will it do to protect
> itself?)
>
> Then, figure out what constraints this imposes on the choice of
> components that can be stuffed *in* the case. It may be that
> the server-side of this project is just not well suited to
> an "open" solution. Maybe just let folks design their own
> "motes" and "applets" and keep the server's design more "controlled".
What is the level of guarantee needed? Who cares if you don't run the
user applets and what not? (See more below.)
>
>> Something would have to screw up pretty badly for the machine to melt or
>> even damage itself. Generally you'd just see performance go down.
>
> So, what level does it fall to? Where do you "look up" that detail?
> If you can only count on 80% of "normal", then why not set the CPU to
> run at 80% of normal and design the entire system to operate under
> those conditions -- because it *has* to guarantee that all the
> intended work actually gets done!
>
> Or, do you come up with some scheme for prioritizing which activities
> can be shed?
>
> "Hmmm... maybe I shouldn't worry about monitoring for intruders as its
> probably more important to ensure the temperature inside the building
> stays comfortable for the pets/plants/etc? Or, maybe skip watering
> the yard in the hope that it rains while I concentrate on watching
> for burglars? Or, ..."
Or you just give up. Have you done a hazard analysis to understand what
happens if something fails? Which failures are acceptable? If someone
is trusting you $1,000,000,000,000 house to a single computer then they
are asking for trouble. What if the power goes out. Or the the water
main burst and the computer is underwater? The machine throttling and/or
shutting down is just another of these failures. If you really need to
guarantee that it works then you need a second computer, performing the
same calculations, and then checks to make sure that both machines got
the same.
It sounds like the problem is that the system doesn't fail safety. I'm
not sure that it needs to but you are talking like it does. If being
unable to ensure the temperature stays comfortable is a 'serious'
problem, then you need to evaluate how to guarantee that any problem
(e.g. power loss) doesn't cripple the system.
At my current job we are using Alan Bradley PLCs in our Safety System.
In short the system is aware of all the other pieces and if it doesn't
get a heartbeat from the kill button saying "I'm not pressed" then it
kills things.[1] If the amplifiers aren't receiving their enable single
from the Interlock Controller, they won't move, and the brakes will be
de-energized (i.e. clamped shut preventing movement). It's a PITA but
according to MIL-STD-882 it means we are 'safe'.
In your case, I think that the answer is that if the pet will get
uncomfortable after more than 4 hours of offline system then someone
needs to physically go to the premises and make sure that things are OK
every 4 hours. If the plants will die after a 3 days of no water than
someone needs to physically inspect them every ~2 days. When we had a
water main burst at my office over a long weekend, it was the 'security
guard' who caught it, and prevented the basement from flooding. We have
a security guard instead of a bunch of sensors, for exactly that
reason. The cost of the guard, is a lot less than the value of the
stuff being 'guarded'. If the pets, and plants, are that important then
using /just/ a typical computer to safeguard them is a bad idea.
[1] It became particularly apparent that we had a network configuration
problem when my scp'ing a file off of one machine brought everything to
its knees. Turns out that my machine, and the machine I was talking to
were doing so over the PLC/Safety network. The large file that I was
copying delayed the safety signals and things shutdown. Good news is
that everything was safe.
--
-john
To be or not to be, that is the question
2b || !2b
(0b10)*(0b1100010) || !(0b10)*(0b1100010)
0b11000100 || !0b11000100
0b11000100 || 0b00111011
0b11111111
255, that is the answer.
More information about the tfug
mailing list