In a generally reliable network, machines are up most of the time. However, as soon a required machine goes down or becomes unavailable/inaccessible, we need to be able to adapt or at least report on the failure.
We will show how to set up an event to provide a measure of robustness.
In this example, we will assume that an event has failed (failing event) and we simply need to report on this failure (reporting event). The goal will be to ensure a high probability that the reporting event will succeed.
The only part of the failing event that we are concerned with is the
failover_event which we will direct to the reporting event (
The reporting event will:
- try to log the failure on one of 5 high availability machines
- fallback to the localhost if the other machines are not available
HOSTSis a :-separated list of hosts
HOSTselects the host using
#HCRON_EVENT_CHAINas an index into
commandis set to update a log with a message tied to the original event that failed (
failover_eventreferences itself (
$HCRON_EVENT_NAME) to be called repeatedly
/report_failurestops iterating when it successfully launches
localhostis used a a fallback in case no other machines are available
template_nameis set so that the event will never execute
One of the drawbacks of the above setup is that the host list is fixed. This means that
log1 will be hit first, all the time. There is no load balancing. We can address this by using a DNS which provides round-robin resolution of a hostname (e.g.,
log). Our event file could then be modified to contain:
logis attempted 3 times (regardless of which hosts it actually resolves to); the goal is simply to spread out the load, not to ensure that all hosts are tried
- we fallback on the fixed list as before to ensure that all are tried
The example uses event chaining but in a slightly different way than might be expected. Rather than chaining events on success (via
next_event), we are chaining using
failover_event until there is no failure.