Skip to main content

Detect missed executions with OpenNMS

Everyone knows that OpenNMS is a powerful monitoring solution, but not everyone knows that since version 1.10 circa it embeds the Drools rule processing engine. Drools programs can then be used to extend the event handling logic in new and powerful ways.

The following example shows how OpenNMS can be extended to detect missed executions for recurring activities like backups or scheduled jobs.

The core functionality is implemented in the following Drools program (commented below):

First we need to define (at least) two UEIs: one for uei.<yournamespace>/job/recurring/Warning and one for uei.<yournamespace>/job/recurring/Normal. The events must be configured so that a Normal event clears any previous Warning. At the moment I feed these events into OpenNMS using syslog, but I am planning to replace syslog with my sendevent web-hook.
Each event carries three additional params (visible in the screenshot above):

  1. job or backupset : carries the job or backupset name, because one host can execute multiple jobs. It must be used in the event reduction key to achieve correct warnings resolution
  2. every : the value 'every' means it is an externally submitted event while 'missed' is used with events generated internally by expired Drools timers (missed executions). Every can be used as varbind filter to implement different notifications for 'regular' failures and missed execution failures
  3. interval: positive integer value indicating the repeating interval in hours (24 for daily jobs, 1 for hourly jobs, and so on)
Note that with this setup a successful execution will also clear any missed execution alarm.

As for the drools program the relevant parts are: the definition of the Execution fact. Execution carries the data necessary to identify the node and job plus the timer set to the interval value of the event.

The 2 following rules define the handling of the initial and subsequent events while the third handles the expiration of an interval. The code should be self-explanatory, ask in the comments if you need help.


Popular posts from this blog

Indexing Apache access logs with ELK (Elasticsearch+Logstash+Kibana)

Who said that grepping Apache logs has to be boring?

The truth is that, as Enteprise applications move to the browser too, Apache access logs are a gold mine, it does not matter what your role is: developer, support or sysadmin. If you are not mining them you are most likely missing out a ton of information and, probably, making the wrong decisions.
ELK (Elasticsearch, Logstash, Kibana) is a terrific, Open Source stack for visually analyzing Apache (or nginx) logs (but also any other timestamped data).

From 0 to ZFS replication in 5m with syncoid

The ZFS filesystem has many features that once you try them you can never go back. One of the lesser known is probably the support for replicating a zfs filesystem by sending the changes over the network with zfs send/receive.
Technically the filesystem changes don't even need to be sent over a network: you could as well dump them on a removable disk, then receive  from the same removable disk.

A not so short guide to TDD SaltStack formulas

One of the hardest parts about Infrastructure As Code and Configuration Management is establishing a discipline on developing, testing and deploying changes.
Developers follow established practices and tools have been built and perfected over the last decade and a half. On the other hand sysadmins and ops people do not have the same tooling and culture because estensive automation has only become a trend recently.

So if Infrastructure As Code allows you to version the infrastructure your code runs on, what good is it if then there are no tools or established practices to follow?

Luckily the situation is changing and in this post I'm outlining a methodology for test driven development of SaltStack Formulas.

The idea is that with a single command you can run your formula against a matrix of platforms (operating systems) and suites (or configurations). Each cell of the matrix will be tested and the result is a build failure or success much alike to what every half-decent developer of…