OpenNMS: PostgreSQL 9.1 tuning

I have just completed an upgrade from OpenNMS 1.8.11 to the latest and greatest 1.10. The upgrade in itself is easy and the guides on the OpenNMS wiki will serve you well. Instead in this post I'll describe a couple of other changes that I made which improved very much the overall performance and responsiveness of the system.

One is the upgrade from PostgreSQL 8.4 (which came with CentOS) to 9.1 + tuning.
The other is switching from apache to nginx.

Upgrading postgres is mostly a matter of taking a backup, pulling in the right repo, running yum install and finally importing the database.


Tuning postgres
I left opennms running on PostgreSQL 9.1 for a while and then I went checking how well postgres was doing. Postgres 9 already performs significantly better that its 8.x predecessors, but I wanted to do better than out-of-the-box.

As the postgres user I logged in into the opennms database to install a utility that will help me estimate how much of the database is being cached. If the server had enough memory I could then configure postgres to use more memory for caching which means less disk i/o and overall better performance (unless your server starts thrashing, that is).

CREATE EXTENSION pg_buffercache;
create view v_database_cache as SELECT c.relname,pg_size_pretty(count(*) * 8192) as buffered,round(100.0 * count(*) /(SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent,round(100.0 * count(*) * 8192 /pg_relation_size(c.oid),1) AS percent_of_relation 
FROM pg_class c INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.oid,c.relname ORDER BY 3 DESC LIMIT 10;

The second command creates a view as a placeholder for the complex query behind it. The query is documented in the (highly recommended) book PostgreSQL 9.0 High performance.
I then queried the database for its size as follows:

opennms=# select pg_size_pretty(pg_database_size('opennms')) as db_size;
 db_size 
---------
 691 MB
(1 row)

Since the whole database is a little less than 700MB and the server has 4GB of RAM I thought that I could raise the shared_buffers param from the measly 32MB that is the default to something more appropriate like 512MB. After that I restarted both opennms and postgres and waited for a while to make sure that the server wasn't swapping. Two days after I queried the view I created above to see how much of the database was in the cache (which means in RAM):

opennms=# select * from v_database_cache;
            relname            | buffered | buffers_percent | percent_of_relation 
-------------------------------+----------+-----------------+---------------------
 events                        | 266 MB   |            52.0 |                64.5
 event_archives                | 22 MB    |             4.3 |               100.0
 notifications                 | 20 MB    |             3.9 |               100.1
 events_nodeid_display_ackuser | 5616 kB  |             1.1 |                33.2
 outages                       | 5000 kB  |             1.0 |               100.3
 events_ipaddr_idx             | 4496 kB  |             0.9 |                27.5
 events_nodeid_idx             | 4456 kB  |             0.8 |                36.4
 events_uei_idx                | 3648 kB  |             0.7 |                10.5
 iprouteinterface              | 2096 kB  |             0.4 |               101.6
 events_time_idx               | 2256 kB  |             0.4 |                20.0
(10 rows)

Looks good: the events table (the largest by far) and its indexes are mostly cached into RAM!
I could try to raise the shared_buffers value to a number larger than the database size, but since the most expensive relation and its indexes are for more than half cached and since postgres is already using lots of other memory besides shared_buffers I left it as it is.
Performance and responsiveness of the UI improved a lot. I/O wait went down too by a 2-4%.

Swapping NGINX in for Apache
I did it mostly to save CPU and RAM because the amount of resources that even a lightly loaded apache can consume is astounding. Don't take my word for it: google it!
This one's easy too: head to the download page and grab the rpm for your distro.

The net result of the above changes is that the UI is now faster to load and page transitions are smooth without eccessive waiting, even on the node detail page for a system with lots of events.

Hope this helps!