HQ High Availability Failover Guide
Available only in HQ Enterprise
For large deployments, Hyperic provides a special installation to maximize HQ's availability. In a high-availability installation, HQ is installed as a cluster of HQ Servers. One Server serves as the central HQ Server, and if it becomes unavailable, the torch automatically passes to another Server. A high-availability installation is transparent to the end user and Administrator.
Please note that the intention of an HA installation is to provide high availability of HQ, not scalability. For the purposes of high availability, usually only two HQ Servers are needed in a cluster.
| What Technologies Does This Use? HQ high-availability deployments use JBoss Cluster for node detection and promotion and Ehcache's distributed caching for replicating changes throughout the cluster. |
- Overview of a high-availability installation
- How to configure HQ for high availability
- Troubleshooting
If you have any comments or suggestions for this help page, please submit them at the bottom of the page by clicking Add Comment.
Overview of a High-Availability Installation
In a high-availability installation, a cluster contains multiple nodes, and one serves as the primary node. HQ automatically chooses this node, and the choice is transparent to the user and the Administrator. All the HQ UI requests and HQ Agent communications go through a load balancer (either software or hardware), which passes them through to the primary node. Among other tasks, the primary node:
- Handles automatic database maintenance
- Calculates automatic metric baselines
- Serves the HQ GUI
The load balancer does not perform load balancing in this situation. Instead, if the primary node becomes unavailable, the load balancer provides failover to another node in the cluster.

A high-availability installation
How to Configure HQ for High Availability
To configure high availability:
- Choose a database.
- Install all the nodes (HQ Servers).
- Configure the cluster.
- Configure the load balancer.
- Start the nodes.
- Verify the cluster initialization.
Step 1. Choose a database.
All nodes in the HQ cluster are required to share the same database. Hyperic recommends that HQ be configured to run against an external database when running in cluster mode. Oracle, PostgreSQL, and MySQL are supported, using the -oracle, -postgresql, or -mysql flag for the installer. The Complete Installation Guide describes these options.
| Built-in HQ Database You can use the built-in PostgreSQL database that HQ ships with, however extra configuration is required to allow remote connections to the database. |
See Preparing HQ's Database for more information.
Step 2. Install the nodes.
The first node that is installed will create the HQ database schema. Other nodes in the cluster will detect the HQ database and will ask to either recreate or upgrade the database. For these nodes, use the upgrade option.
See the Installation Overview for help installing an HQ Server.
Step 3. Configure the cluster.
Before starting HQ in HA mode, each node in the cluster must be configured to support the cluster. The configuration options are found in the conf/hq-server.conf file, at the end, in the "Cluster Settings" section. This is a one-time operation; the configuration will be retained during upgrades of the HQ Server. Below are listed all the available configuration options, divided into the options you must set and the ones that you may optionally set. (The file itself also indicates whether an option is required or not.)
Required Configuration Options
| Option | Description |
|---|---|
| ha.partition | This property sets the name of the cluster and must be identical on all nodes of the cluster. |
| ha.node.address | This property sets the node address for this node in the cluster. This value is unique for each node in the cluster and can either be an ip address or hostname. Do not use 127.0.0.1 for this setting as this will cause other members of the cluster to not properly detect this node. |
The following table enumerates the default multicast settings. In most cases, you should not need to change the default.
Optional Configuration Options
| Option | Description | Default Value |
|---|---|---|
| ha.node.mcast_addr | The multicast address used to send messages throughout the cluster | 228.1.2.3 |
| ha.node.mcast_port | The multicast UDP port to use to broadcast cluster membership information | 45566 |
| ha.node.cacheListener.port | The multicast TCP port to use for distributed cache detection | 45567 |
| ha.node.cacheProvider.port | The multicast TCP port to use for distributed cache invalidation throughout the cluster | 45568 |
| Upgrading from a Pre-v3.0 HA Configuration Upgraded Servers from pre-3.x versions of HQ will have obsolete cluster settings, especially including the server.cluster.mode and server.ha.bind_addr properties. Remove these settings before running a 3.x HQ Server in cluster mode. |
Step 4. Configure the load balancer for failover.
The load balancer in an HQ high-availability installation provides failover only; it does not load-balance. Configuration of the load balancer depends entirely, of course, on the load balancer (either hardware or software) being used, but ultimately the load balancer needs to know which nodes are in the cluster and how to failover from one to the other.
The JBoss configuration must be performed on every node in the cluster, regardless of the type of load balancer.
| Direct all Agent and UI Communications through the Load Balancer In order for a high-availability installation to work, all the HQ Agents and the UI must communicate through the load balancer. Use the load balancer's IP address and port for all such traffic. |
Example. Below are standard instructions for a sample configuration of an Apache Tomcat load balancer, with mod_jk (1.2.25), to be used in failover mode in an HA installation of only two nodes. In the case of Apache Tomcat, you need to indicate the primary node and specify the failover order. However, this isn't necessarily the case in other load balancers.
To configure an Apache Tomcat load balancer for an HA installation:
- Download and install Apache Tomcat. Get it here.
- Download mod_jk from here.
- Copy mod_jk to the Apache modules directory.
- Add the following properties to httpd.conf.
# change the mod_jk library filename as appropriate below LoadModule jk_module modules/mod_jk-apache-2.2.4.so <IfModule jk_module> JkWorkersFile conf/worker.properties JkLogFile logs/mod_jk.log JkLogLevel info JkLogStampFormat "[%a %b %d %H:%M:%S %Y] " # forward all traffic to loadbalancer worker (see worker.properties below) JkMount /* loadbalancer </IfModule> - Create a new file — worker.properties — and copy the following lines into it. In this file, you can see two nodes: Node 2 is specified as the preferred failover node, so it only gets traffic when node 1 is down.
worker.list=loadbalancer # Define Node 1 PRIMARY worker.node1.port=2009 worker.node1.host=10.2.0.139 worker.node1.type=ajp13 worker.node1.lbfactor=1 # Define preferred failover node for node 1 worker.node1.redirect=node2 # Define Node 2 SECONDARY worker.node2.port=2009 worker.node2.host=10.2.0.138 worker.node2.type=ajp13 worker.node2.lbfactor=1 # Disable worker2 for all requests except failover worker.node2.activation=disabled # Load-balancing behaviour worker.loadbalancer.type=lb worker.loadbalancer.balance_workers=node1,node2
- Change all the sample values (port, host IP address, etc.) listed above to those appropriate for your HQ Servers. Note that, in this file, the IP address specified in worker.node#.host is the IP address of JBoss on that node. Change the IP address to accommodate your JBoss installation.
To configure JBoss for an HA installation:
- Make sure the following lines are not commented in server.xml so that it will accept ajp13 connections.
server.xml is located in <HQ Server directory>/conf/template.<Connector port="9009" address="${jboss.bind.address}" emptySessionPath="true" enableLookups="false" redirectPort="7443" protocol="AJP/1.3"/> - Again in server.xml, add jvmRoute to the Engine element:
<Engine name="jboss.web" defaultHost="localhost" jvmRoute="node1">
Please note that the jvmRoute value must match the name of the primary node specified above in worker.<nodename>.
- In jboss-service.xml, change UseJK to true.
The file is located in <HQ Server directory>/hq-engine/server/default/deploy/jbossweb-tomcat55.sar/META-INF.<attribute name="UseJK">true</attribute>
- Repeat these steps for every node in the cluster.
Now restart Apache and all the nodes (HQ Servers).
Step 5. Start the nodes.
Actually start the HQ Servers. Need help doing this? Consult the instruction for a non-Windows environment or Windows environment.
Step 6. Verify the cluster initialization.
After starting a cluster node, verify that the cluster initialization succeeded by looking at the server.log. Upon successful cluster initialization, you should see a message like this:
INFO [main] [com.hyperic.hq.ha.server.session.HAStartupListener] Enabling clustered services on partition HQCluster (Node address=10.2.0.139 multicast address=228.1.2.3:55566) INFO [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Initializing INFO [DownHandler (UDP)] [org.jgroups.protocols.UDP] sockets will use interface 10.2.0.139 INFO [DownHandler (UDP)] [org.jgroups.protocols.UDP] socket information: local_addr=10.2.0.139:33420 (additional data: 15 bytes), mcast_addr=228.1.2.3:55566, bind_addr=/10.2.0.139, ttl=64 sock: bound to 10.2.0.139:33420, receive buffer size=64000, send buffer size=32000 mcast_recv_sock: bound to 10.2.0.139:55566, send buffer size=135168, receive buffer size=80000 mcast_send_sock: bound to 10.2.0.139:33421, send buffer size=135168, receive buffer size=80000 INFO [UpHandler (GMS)] [STDOUT] ------------------------------------------------------- GMS: address is 10.2.0.139:33420 (additional data: 15 bytes) ------------------------------------------------------- INFO [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Number of cluster members: 2 INFO [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Other members: 1 INFO [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Fetching state (will wait for 60000 milliseconds): INFO [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] New cluster view for partition HQCluster: 3 ([10.2.0.138:2099, 10.2.0.139:2099] delta: 0) INFO [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] I am (null) received membershipChanged event: INFO [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] Dead members: 0 ([]) INFO [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] New Members : 0 ([]) INFO [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] All Members : 2 ([10.2.0.138:2099, 10.2.0.139:2099])
The above console output shows that the cluster HQCluster has started and also lists the current cluster's two members: 10.2.0.138:2099 and 10.2.0.139:2099. When you start up other nodes with the same HQCluster partition name, you should see additional entries in the membership list.
Troubleshooting
This section enumerates the most common sources of problems with configuring a high-availability installation.
| Source of Problem | Why It Happened or What to Do |
|---|---|
| Multicast blocking | The cluster detection and cache peer detection relies on multicast. Make sure your router isn't blocking multicast packets; otherwise the HQ cluster will fail to initialize properly. It's also common for virtualization technologies like VMware and Xen to not enable multicast by default. |
| Don't register agents using the loopback address | If you plan to install agents on the cluster nodes, do not use the loopback address (127.0.0.1) — that is, the IP address that the HQ Server should use to contact the Agent) — when registering the Agent. Registering Agents using the loopback address could result in attempting to contact the wrong Agent. |
| Alerts that were currently firing or in escalation were "lost" | A failover to another cluster node occurred in the middle of the alerts being fired or escalated. The alert state could be lost. |
|
Next Steps |
Related Topics |