HQ High Availability Failover Guide

Available only in HQ Enterprise

For large deployments, Hyperic provides a special installation to maximize HQ's availability. In a high-availability installation, HQ is installed as a cluster of HQ Servers. One Server serves as the central HQ Server, and if it becomes unavailable, the torch automatically passes to another Server. A high-availability installation is transparent to the end user and Administrator.

Please note that the intention of an HA installation is to provide high availability of HQ, not scalability. For the purposes of high availability, usually only two HQ Servers are needed in a cluster.

What Technologies Does This Use?
HQ high-availability deployments use JBoss Cluster for node detection and promotion and Ehcache's distributed caching for replicating changes throughout the cluster.

If you have any comments or suggestions for this help page, please submit them at the bottom of the page by clicking Add Comment.

Overview of a High-Availability Installation

In a high-availability installation, a cluster contains multiple nodes, and one serves as the primary node. HQ automatically chooses this node, and the choice is transparent to the user and the Administrator. All the HQ UI requests and HQ Agent communications go through a load balancer (either software or hardware), which passes them through to the primary node. Among other tasks, the primary node:

The load balancer does not perform load balancing in this situation. Instead, if the primary node becomes unavailable, the load balancer provides failover to another node in the cluster.


A high-availability installation

back to top

How to Configure HQ for High Availability

To configure high availability:

  1. Choose a database.
  2. Install all the nodes (HQ Servers).
  3. Configure the cluster.
  4. Configure the load balancer.
  5. Start the nodes.
  6. Verify the cluster initialization.

Step 1. Choose a database.

All nodes in the HQ cluster are required to share the same database. Hyperic recommends that HQ be configured to run against an external database when running in cluster mode. Oracle, PostgreSQL, and MySQL are supported, using the -oracle, -postgresql, or -mysql flag for the installer. The Complete Installation Guide describes these options.

Built-in HQ Database

You can use the built-in PostgreSQL database that HQ ships with, however extra configuration is required to allow remote connections to the database.

See Preparing HQ's Database for more information.

Step 2. Install the nodes.

The first node that is installed will create the HQ database schema. Other nodes in the cluster will detect the HQ database and will ask to either recreate or upgrade the database. For these nodes, use the upgrade option.

See the Installation Overview for help installing an HQ Server.

Step 3. Configure the cluster.

Before starting HQ in HA mode, each node in the cluster must be configured to support the cluster. The configuration options are found in the conf/hq-server.conf file, at the end, in the "Cluster Settings" section. This is a one-time operation; the configuration will be retained during upgrades of the HQ Server. Below are listed all the available configuration options, divided into the options you must set and the ones that you may optionally set. (The file itself also indicates whether an option is required or not.)

Required Configuration Options

Option Description
ha.partition This property sets the name of the cluster and must be identical on all nodes of the cluster.
ha.node.address This property sets the node address for this node in the cluster. This value is unique for each node in the cluster and can either be an ip address or hostname. Do not use 127.0.0.1 for this setting as this will cause other members of the cluster to not properly detect this node.

The following table enumerates the default multicast settings. In most cases, you should not need to change the default.

Optional Configuration Options

Option Description Default Value
ha.node.mcast_addr The multicast address used to send messages throughout the cluster 228.1.2.3
ha.node.mcast_port The multicast UDP port to use to broadcast cluster membership information 45566
ha.node.cacheListener.port The multicast TCP port to use for distributed cache detection 45567
ha.node.cacheProvider.port The multicast TCP port to use for distributed cache invalidation throughout the cluster 45568
Upgrading from a Pre-v3.0 HA Configuration

Upgraded Servers from pre-3.x versions of HQ will have obsolete cluster settings, especially including the server.cluster.mode and server.ha.bind_addr properties. Remove these settings before running a 3.x HQ Server in cluster mode.

Step 4. Configure the load balancer for failover.

The load balancer in an HQ high-availability installation provides failover only; it does not load-balance. Configuration of the load balancer depends entirely, of course, on the load balancer (either hardware or software) being used, but ultimately the load balancer needs to know which nodes are in the cluster and how to failover from one to the other.

The JBoss configuration must be performed on every node in the cluster, regardless of the type of load balancer.

Direct all Agent and UI Communications through the Load Balancer

In order for a high-availability installation to work, all the HQ Agents and the UI must communicate through the load balancer. Use the load balancer's IP address and port for all such traffic.

Example. Below are standard instructions for a sample configuration of an Apache Tomcat load balancer, with mod_jk (1.2.25), to be used in failover mode in an HA installation of only two nodes. In the case of Apache Tomcat, you need to indicate the primary node and specify the failover order. However, this isn't necessarily the case in other load balancers.

To configure an Apache Tomcat load balancer for an HA installation:

  1. Download and install Apache Tomcat. Get it here.
  2. Download mod_jk from here.
  3. Copy mod_jk to the Apache modules directory.
  4. Add the following properties to httpd.conf.
    # change the mod_jk library filename as appropriate below
    LoadModule jk_module modules/mod_jk-apache-2.2.4.so
    
    <IfModule jk_module>
        JkWorkersFile conf/worker.properties
        JkLogFile logs/mod_jk.log
        JkLogLevel info
        JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "
        # forward all traffic to loadbalancer worker (see worker.properties below)
        JkMount /* loadbalancer
    </IfModule>
    
  5. Create a new file — worker.properties — and copy the following lines into it. In this file, you can see two nodes: Node 2 is specified as the preferred failover node, so it only gets traffic when node 1 is down.
    worker.list=loadbalancer
    
    # Define Node 1 PRIMARY
    worker.node1.port=2009
    worker.node1.host=10.2.0.139
    worker.node1.type=ajp13
    worker.node1.lbfactor=1
    # Define preferred failover node for node 1
    worker.node1.redirect=node2
    
    # Define Node 2 SECONDARY
    worker.node2.port=2009
    worker.node2.host=10.2.0.138
    worker.node2.type=ajp13
    worker.node2.lbfactor=1
    # Disable worker2 for all requests except failover 
    worker.node2.activation=disabled
    
    # Load-balancing behaviour
    worker.loadbalancer.type=lb
    worker.loadbalancer.balance_workers=node1,node2
    
  6. Change all the sample values (port, host IP address, etc.) listed above to those appropriate for your HQ Servers. Note that, in this file, the IP address specified in worker.node#.host is the IP address of JBoss on that node. Change the IP address to accommodate your JBoss installation.

To configure JBoss for an HA installation:

  1. Make sure the following lines are not commented in server.xml so that it will accept ajp13 connections.
    server.xml is located in <HQ Server directory>/conf/template.
    <Connector port="9009" address="${jboss.bind.address}"
       emptySessionPath="true" enableLookups="false" redirectPort="7443" 
       protocol="AJP/1.3"/>
    
  2. Again in server.xml, add jvmRoute to the Engine element:
    <Engine name="jboss.web" defaultHost="localhost" jvmRoute="node1">
    

    Please note that the jvmRoute value must match the name of the primary node specified above in worker.<nodename>.

  3. In jboss-service.xml, change UseJK to true.
    The file is located in <HQ Server directory>/hq-engine/server/default/deploy/jbossweb-tomcat55.sar/META-INF.
    <attribute name="UseJK">true</attribute>
    
  4. Repeat these steps for every node in the cluster.

Now restart Apache and all the nodes (HQ Servers).

Step 5. Start the nodes.

Actually start the HQ Servers. Need help doing this? Consult the instruction for a non-Windows environment or Windows environment.

Step 6. Verify the cluster initialization.

After starting a cluster node, verify that the cluster initialization succeeded by looking at the server.log. Upon successful cluster initialization, you should see a message like this:

"server.log"
INFO  [main] [com.hyperic.hq.ha.server.session.HAStartupListener] Enabling clustered services on partition HQCluster (Node address=10.2.0.139 multicast address=228.1.2.3:55566)
INFO  [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Initializing
INFO  [DownHandler (UDP)] [org.jgroups.protocols.UDP] sockets will use interface 10.2.0.139
INFO  [DownHandler (UDP)] [org.jgroups.protocols.UDP] socket information:
local_addr=10.2.0.139:33420 (additional data: 15 bytes), mcast_addr=228.1.2.3:55566, bind_addr=/10.2.0.139, ttl=64
sock: bound to 10.2.0.139:33420, receive buffer size=64000, send buffer size=32000
mcast_recv_sock: bound to 10.2.0.139:55566, send buffer size=135168, receive buffer size=80000
mcast_send_sock: bound to 10.2.0.139:33421, send buffer size=135168, receive buffer size=80000
INFO  [UpHandler (GMS)] [STDOUT]
-------------------------------------------------------
GMS: address is 10.2.0.139:33420 (additional data: 15 bytes)
-------------------------------------------------------
INFO  [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Number of cluster members: 2
INFO  [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Other members: 1
INFO  [main] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] Fetching state (will wait for 60000 milliseconds):
INFO  [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.interfaces.HAPartition.HQCluster] New cluster view for partition HQCluster: 3 ([10.2.0.138:2099, 10.2.0.139:2099] delta: 0)
INFO  [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] I am (null) received membershipChanged event:
INFO  [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] Dead members: 0 ([])
INFO  [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] New Members : 0 ([])
INFO  [UpHandler (STATE_TRANSFER)] [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.HQCluster] All Members : 2 ([10.2.0.138:2099, 10.2.0.139:2099])

The above console output shows that the cluster HQCluster has started and also lists the current cluster's two members: 10.2.0.138:2099 and 10.2.0.139:2099. When you start up other nodes with the same HQCluster partition name, you should see additional entries in the membership list.

back to top

Troubleshooting

This section enumerates the most common sources of problems with configuring a high-availability installation.

Source of Problem Why It Happened or What to Do
Multicast blocking The cluster detection and cache peer detection relies on multicast. Make sure your router isn't blocking multicast packets; otherwise the HQ cluster will fail to initialize properly. It's also common for virtualization technologies like VMware and Xen to not enable multicast by default.
Don't register agents using the loopback address If you plan to install agents on the cluster nodes, do not use the loopback address (127.0.0.1) — that is, the IP address that the HQ Server should use to contact the Agent) — when registering the Agent. Registering Agents using the loopback address could result in attempting to contact the wrong Agent.
Alerts that were currently firing or in escalation were "lost" A failover to another cluster node occurred in the middle of the alerts being fired or escalated. The alert state could be lost.

back to top

Next Steps
Related Topics
Return to the Installation Overview.

Browse Space

- Pages
- News
- Labels
- Attachments
- Bookmarks
- Mail
- Advanced

Explore Confluence

- Popular Labels
- Notation Guide

Your Account

Log In

or Sign Up  

Other Features

Add Content


System Monitoring Software
SourceForge.net Logo