[Tfug] Communications reliability
Bexley Hall
bexley401 at yahoo.com
Mon Dec 31 10:25:38 MST 2007
Hi,
I'm working on a distributed application. I
have done this sort of thing before, *but*,
always in an environment where the interconnect
wiring was considered a *crucial* element of
the system. I.e., playing a role similar to
the data bus between a CPU and it's memory
(in that you *assume* the integrity of that).
If a cable was cut, unplugged or an interface
died, it was *proper* for the system to "stop
working" (though in an orderly, controlled
fashion).
Now, however, I am dealing with a scenario in
which nodes may become disconnected accidentally
or *intentionally* and the "system" must cope
with their lack of availability. I.e., the
features/facilities/capabilities that they
represent/implement are no longer accessible
but the application itself must still continue
to run.
In a very loose sense, this is similar to how
The Internet works: if a site is "down", then
you (e.g., a web browser) just can't access the
assets of that site -- but the rest of The
Internet is still accesible.
However, (building on that web browser example)
the web browser <-> user interface is really
quite crude in that instance: "site unavailable".
Worse, yet, this hides a plethora of problems that
might be the cause (i.e., the site might be "up"
but a router between here and there might be
having difficulties).
Note that browsers (and other network clients
on which they rely) *tend* to silently make
several repeated attempts to achieve their
goal. Usually, relying on something as crude
as a time-out to determine when to "fail".
I'm looking for suggestions as to how a "node"
(client *or* server) might more intelligently
do this.
E.g., when *I* encounter a problem with a
web site, I fall back on other tools to try
to determine if there is a *real* problem
(connectivity, etc.) or if this is perhaps
just a temporary overload of some resource
(e.g., network bandwidth, the site's available
computing power, etc.). This tells me:
- if there is a "real" problem (vs. having a
time-out that is presently "too quick")
- where that problem might be
- how likely I am to be able to complete my
request "if I am persistent"
Note that, to some extent, we all do this.
Some folks might hammer away at a site (resource)
until they "get through". Others, might quickly
shift their attention to an alternate site that
might suffice (e.g., try the next "result" in
a search if the present one is "not answering").
This gets a bit trickier if a machine has to
make these decisions. :< Hence, my search for
responsible algorithms to package this.
E.g., a simple ping can tell you that connectivity
exists and the target *appears* to be "up". So,
if some other service is not answering, it is
likely that the problem lies in that service...
I can be a bit smarter than this, of course (e.g.,
look at what other traffic is running on that I/F
and build a tiny expert system to tell me whether
the I/F is at fault, the network, the targeted node,
etc.)
Note that I can access lots of things (data)
that aren't typically checked in the above example
(e.g., I can verify that carrier is present on
an interface to ensure that *my* cable isn't
unplugged, etc.)
Any "more elegant" (or, "more practical") suggestions?
Thx,
--don
____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
More information about the tfug
mailing list