Skip to main content

OpenNMS performance: tune Jrobin RRD file strategy

One of the nice aspects of OpenNMS is that, out of the box, it will collect a lot of data from most snmp-enabled resources. The downside is that such collection is I/O heavy (iops, not throughput).

Even on moderate installations with hundreds of nodes it is enough to swamp even the fastest disk subsystem (except for those with controllers supported by large write caches). A symptom is that I/O wait will be quite high on the opennms box itself.
I/O Wait before and after switch jrobin backend from FILE to MNIO
The graph above show the I/O wait on a RAID 10 array with 4 15K drives, storing RRD data from approximately 300 nodes, sampled at the usual 5m interval.

The I/O wait is, or better was, constantly at 30% (before I applied some postgres tuning it was at 70%).

As you can see from the graph I/O wait fell sharply after 7.30 when I applied a simple change to Jrobin which is the OpenNMS subsystem responsible for writing and reading RRDs.

The change involves using an alternative I/O strategy called MNIO instead of FILE which is the default. It requires editing just a properties file. Restart is required.

The box has been running with the new setting for several days now without errors and excellent performance. On the mailing list someone reported years of running successfully with MNIO.

HTH.

Comments

Popular posts from this blog

From 0 to ZFS replication in 5m with syncoid

The ZFS filesystem has many features that once you try them you can never go back. One of the lesser known is probably the support for replicating a zfs filesystem by sending the changes over the network with zfs send/receive.
Technically the filesystem changes don't even need to be sent over a network: you could as well dump them on a removable disk, then receive  from the same removable disk.

Mirth: recover space when mirthdb grows out of control

I was recently asked to recover a mirth instance whose embedded database had grown to fill all available space so this is just a note-to-self kind of post.
Btw: the recovery, depending on db size and disk speed, is going to take long.

The problem A 1.8 Mirth Connect instance was started, then forgotten (well neglected, actually). The user also forgot to setup pruning so the messages filled the embedded Derby database until it grew to fill all the available space on the disk. The SO is linux.

The solution First of all: free some disk space so that the database can be started in embedded mode from the cli. You can also copy the whole mirth install to another server if you cannot free space. Depending on db size you will need a corresponding amount of space: in my case a 5GB db required around 2GB to start, process logs and then store the temp files during shrinking.

Then open a shell as the user that mirth runs as (you're not running it as root, are you?) and cd into the mirth home. …

Indexing Apache access logs with ELK (Elasticsearch+Logstash+Kibana)

Who said that grepping Apache logs has to be boring?

The truth is that, as Enteprise applications move to the browser too, Apache access logs are a gold mine, it does not matter what your role is: developer, support or sysadmin. If you are not mining them you are most likely missing out a ton of information and, probably, making the wrong decisions.
ELK (Elasticsearch, Logstash, Kibana) is a terrific, Open Source stack for visually analyzing Apache (or nginx) logs (but also any other timestamped data).