Skip to end of metadata
Go to start of metadata

Contents

Introduction

In a generally reliable network, machines are up most of the time. However, as soon a required machine goes down or becomes unavailable/inaccessible, we need to be able to adapt or at least report on the failure.

We will show how to set up an event to provide a measure of robustness.

Example

In this example, we will assume that an event has failed (failing event) and we simply need to report on this failure (reporting event). The goal will be to ensure a high probability that the reporting event will succeed.

The only part of the failing event that we are concerned with is the failover_event which we will direct to the reporting event (/report_failure):

...
failover_event=/report_failure

The reporting event will:

  • try to log the failure on one of 5 high availability machines
  • fallback to the localhost if the other machines are not available
/report_failure
HOSTS=:log1:log2:log3:log4:log5:localhost
HOST=$HOSTS[#HCRON_EVENT_CHAIN]
 
as_user=
host=$HOST
command=log "error" "hcron event failure on $HCRON_EVENT_CHAIN[0]"
notify_email=
notify_message=
when_month=*
when_day=*
when_hour=*
when_minute=*
when_dow=*
template_name=/report_failure
failover_event=$HCRON_EVENT_NAME

Notes:

  • HOSTS is a :-separated list of hosts
  • HOST selects the host using #HCRON_EVENT_CHAIN as an index into HOSTS
  • command is set to update a log with a message tied to the original event that failed ($HCRON_EVENT_CHAIN[0])
  • failover_event references itself ($HCRON_EVENT_NAME) to be called repeatedly
  • /report_failure stops iterating when it successfully launches
  • localhost is used a a fallback in case no other machines are available
  • template_name is set so that the event will never execute

One of the drawbacks of the above setup is that the host list is fixed. This means that log1 will be hit first, all the time. There is no load balancing. We can address this by using a DNS which provides round-robin resolution of a hostname (e.g., log). Our event file could then be modified to contain:

HOSTS=:log:log:log:log1:log2:log3:log4:log5:localhost

Notes:

  • log is attempted 3 times (regardless of which hosts it actually resolves to); the goal is simply to spread out the load, not to ensure that all hosts are tried
  • we fallback on the fixed list as before to ensure that all are tried

Conclusion

The example uses event chaining but in a slightly different way than might be expected. Rather than chaining events on success (via next_event), we are chaining using failover_event until there is no failure.

  • No labels