[Tfug] CPU Query

Wed Apr 4 11:56:25 MST 2007

--- Jim Secan <jim at nwra.com> wrote:

> > I assume you mean to imply that your tasks
> > are "compute-bound" and not I/O-bound?  Do you
> > have enough awareness of what the actual
> algorithms
> > entail (e.g., fixed point vs. floating point,
> etc.)
> 
> I wrote all the code, so I know exactly what it's
> doing (OK, so I didn't

Well, at least you *hope* you know!  ;-)

> write the SVD package, but that's from one of the
> optimized libraries). 
> I/O has all been optimized such that so you read it
> all in (binary

Actually, this is one place where two processors
could have saved you something -- since the I/O
can happen concurrent with the processing.  (assuming
there is a LOT of I/O...)

> unformatted), crunch numbers, and then write it all
> out.  The "bind" is in
> floating-point operations (mostly matrix
> manipulations - this is a largish

Yes, floating point is almost always a pig.
But, the time required to *accurately* do away
with floating point in favor of fixed point math
rarely makes sense on "run once" applications
(by that, I mean anything that doesn't run
frequently -- for sufficiently large values of
"sufficiently"...)

> inverse problem).  My interest is in whether the OS
> can take advantage of
> the 2X CPUs without my having to get a compiler
> (Fortran) that will do

Recompiling with a modern (?) compiler would be
an inexpensive first step.  I assume you have
taken care to look at just how you access the
matrices soas to not invalidate the effectiveness
of and D-cache on the machine?  E.g.,

for (r = 0; r < ROWMAX; r++)
  for (c = 0; c < COLMAX; c++)
    matrix(r,c) = F(r,c);

behaves quite differently from:

for (c = 0; c < COLMAX; c++)
  for (r = 0; r < ROWMAX; r++)
    matrix(r,c) = F(r,c);

Also, make sure you have enough physical memory
to avoid any paging as this would quickly lead to
thrashing when manipulating a large matrix.

> this.  Either that, or get into the manual "loop
> unrolling" business,
> which loses me more (in my time) than I would gain. 
> I want to know if
> paying a little more for a 2X CPU will gain me in
> throughput without my
> having to do anything other than copy codes over
> from my current FC3

"Measure, then optimize".  Why not try running the
code on a small data set and time it?  I wonder
if the multicore boxes do anything notably different
than a multiCPU box?  I.e., perhaps find a generic
2 CPU box, run the code.  Pull one of the CPU's and
run it again?  If there was a notable decrease in
performance, I would be encouraged.  Unfortunately,
if there was *minimal* difference, I wouldn't
conclude anything from it (since there may be
differences brought about by the fact that dual CPU
designs have to bring everything out through the pad
drivers (significant delay) while a dual *core*
can skip this...

> system and go.  As a related side issue, I could
> care less about video
> performance - I work at the command line and could
> live with this on a TTY
> user interface.
> 
> I have heard that some OS's (distros) will do a sort
> of load-leveling, but
> I don't know what sort of gain this would provide
> for a single process.  I

Most of the "system" time spent on a *workstation*
(as different from a "server") is negligible.
Running the network stack can eat up some resources
but I suspect you aren't *moving* any data across
the wire so that would be negligible.  Likewise,
any threads servicing I/O would be minimal in
your description of your implementation.

> have doubts about that, and that's why I'm asking. 
> I certainly don't want
> to find that I pay more for a 2X only to find that
> my processing runs
> slower than a comparable speed (and cheaper) 1X

Exactly.  This is the Windows model...  :>

> because I gain nothing
> from the second processor and lose from extra things
> the OS is doing
> because it knows it has more than one processing
> path through the CPU.  I
> have seen this sort of thing happen to people trying
> to parallelize or
> vectorize their codes.  Definitely a YMMV situation
> (and possible also a
> TANSTAAFL situation RE gain without pain).

My approach:  (assuming the code/system is portable)

Try runnig the code on the sort of box(es) you
are looking at and time them.  If you don't see
at least a 25% improvement, your better bet is just
to wait 3 months for a faster 1X CPU to become
available.

*If* you have a multiCPU box handy, perhaps try the
"pull one CPU" trick and see if it gives you any
marked difference.  (it would be interesting to
*know*...)

How long can you afford to wait?  :>  I.e., I tend
to operate on the adage "do nothing unless you
can realize a 2X performance increase -- cuz the
time it takes to implement/debug/validate can better
be spent WAITING for faster hardware..."

HTH,
--don

P.S.  *I* would appreciate hearing how this shakes
out for you...

____________________________________________________________________________________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html