Defining Alerts
After your resources are in the HQ inventory and metrics are being collected, the first step in using the alerting functionality is defining the alerts.
First, determine which resource the alert will be defined for and navigate to the Current Health page for that resource. Next, click on the
tab. This will bring you to the List Alert Definitions page. To go to the List Alerts page, click on the
button. To return to the alert definitions list page, click on the
button. Lastly, click the
button to go to the New Alert Definition page.
An alternate navigation to this page is to navigate to any metric chart and click the
link in the upper right corner of the page. This has an added benefit of pre-populating the metric drop-down with the particular metric that was being charted.
The first thing to do on the New Alert Definition page is to name the alert. Give it a name that will clearly tell you what the problem is. The name of the alert will be displayed in the subject header of an alert notification email like this:
Subject: [HQ] !! resource-name alert-name
For example, if your resource was a MySQL 3.x server residing on a Linux platform named grimlock.hyperic.net, your Tomcat resource name is probably "grimlock.hyperic.net MySQL 3.x". If you wanted to set up an alert for if the MySQL server goes down, you might want to name it simply "down!" and give it a high priority. Then the subject header of your alert notification mail would look like:
Subject: [HQ] !!! grimlock.hyperic.net MySQL 3.x down!
Since you already know which resource the alert is being defined for, the next step is to determine why the alert will fire. Alerts can be defined for metric thresholds, inventory property changes, or log/config tracking events. Additionally, if the resource supports control actions, an alert can be defined for control actions executed on the resource.
- To set an alert on a metric threshold, select the metric from the Metric drop-down box, select the radio button for absolute value or baseline, select the operator (>, < or =) from the operator drop-down and for absolute value thresholds, simply enter the threshold in the Absolute Value text field.
- For a value change alert simply select the value change radio button and the alert will fire if the value for the metric ever changes.
- For baseline thresholds, select one of the values (Baseline, Min, Max) from the baseline drop-down and enter the percentage in the textbox. *
- To alert on an inventory property value change, simply select the desired inventory property. *
- If the resource supports control actions, you can alert on a control action and its resulting status. First select the control action, and then select one of the following states: In Progress, Completed, or Failed. *
- If the resource supports event/log tracking, you can alert on the log level and, optionally, a substring to match in the log string. *
If the alert will be based on multiple conditions click the Add Another Condition link, select the appropriate AND or OR operator from the drop-down and add the new condition. Repeat for as many conditions as you need for the alert. *
Next, is the Recovery Alert drop-down. This is used when defining a Recovery Alert. You must already have at least one alert defined for the resource before you can define a Recovery Alert. Recovery Alerts are documented in detail in the Defining Recovery Alerts section. *
Next, is the determination for when and how often the alert will fire when the defined conditions are met.
Each time conditions are exceeded or met
This is the selection that will apply to most alerts. This means "alert me immediately when the conditions for my alert have been met". However, with no further configuration this also means "continue alerting me every x minutes until the conditions are no longer met". x will vary depending on the collection interval for the metric(s) in the alert definition, but generally it will be either 1, 5 or 10. This selection is very effective when defined along with a Recovery Alert which will eliminate the 'alert storm' described above.
When conditions are exceeded for X within a time period of Y *
This is a fairly complex action and is most effective when the time periods represented by X and Y are relatively large. For instance, if you want to make Y anything less than 30 minutes, you probably want to use the "Each time conditions are exceeded or met" selection.
To really understand this configuration selection you need to be familiar with the concept of metric collection intervals and know the collection interval(s) of the metric(s) in your alert definition. Let's say your alert definition was for Free Memory < 10M and you wanted to be alerted whenever this condition was met for 20 minutes within a time period of 1 hour. Its not absolutely necessary to know that the collection interval for the Free Memory metric is 5 minutes, but it helps because then you know that there are 12 collections per hour for that metric and if 4 of those 12 collections meet or exceed the threshold, the alert will fire.
Once every X times conditions are exceeded within a time period of Y *
This selection is very similar to the previous one. With the previous option, it was nice to know the collection intervals for the metrics in the alert definition. With this option, it is absolutely necessary. It is necessary because it is possible to create an alert that is impossible to fire. If the metric collection interval is large enough that X collections will never be taken in the Y time period, the alert can never fire. For example, if we take our Free Memory metric with its 5 minute collection interval and configure an alert definition to fire "Once every 15 times conditions are exceeded within a time period of 1 hour." this alert will never fire. Because we know that only 12 collections will be taken per hour so we will never see 15 metrics per hour, much less 15 exceeding the alert threshold.
Filtering Alerts *
The last section of the New Alert Definition page is the Enable Action Filters section. This section is entirely optional, but allows you to fine tune the alerting functionality HQ offers.
Disable alert until re-enabled manually or by recovery alert
This is the configuration option mentioned above that will prevent 'alert storms'. Selecting this option will disable the alert after it fires so it does not repeatedly fire. The alert will become re-enabled if it is re-enabled manually within HQ or if a Recovery Alert re-enables it automatically. Recovery Alerts are covered in greater detail in the Defining Recovery Alerts section.
Disregard control actions that are defined for related alerts.
This option will only appear on New Alert Definition pages for resources that support control actions. This option only applies when these conditions are met:
- The current alert definition will include an alert action
- The resource associated with the alert is a member of an application
- There are other members of the same application with alerts that fire control actions (ideally the same control action)
If the conditions are met, this configuration option will make it so that if multiple alerts are fired within a short period from resources that are members of the same application, only one control action will be executed. This is to prevent a server from being restarted several times in a short period of time for the same alert conditions. An Example would be if you had an alert to restart a Tomcat server if the JVM Free Memory got too low and then another alert to restart the same server if the JVM Active Thread count got too high. If both of these alerts fired at the same time and they were filtering control actions, only 1 restart control action would be executed and not two.
Filter notification actions that are defined for related alerts.
This option only has an effect on resources that are members of an application. It is very useful for cutting down on the number of alerts from a single application. An example of the functionality would be if you had several resources in an application and each with several alerts. If alerts started firing from several resources, and if each of the alerts was configured for filtering notifications, you would get a single email alert with all the alerts from all the resources consolidated in it.
Completing the definition process
Clicking OK will create the alert, enable it and take you to the View Alert Definition page. From this page, you can see the details of the alert definition you just created as well as configure notification. Control actions can also be configured from this page if the resource supports control.
Notification can be set by HQ Role, HQ User, arbitrary email address, or SNMP traps. The Roles tab is selected by default. To notify by role simply click
select the role(s) to be notified and click OK. To notify by HQ User, click the Notify HQ Users tab, click
and select the user(s) to be notified. To notify by arbitrary email address, click the Notify Other Recipients tab, click
and enter the email address(es) to be notified. If SNMP traps are enabled on HQ, there will be an SNMP Trap tab. To have an SNMP trap sent when the alert fires, click on the SNMP Trap tab, enter the trap destination and the OID and click the SET button. See the Enabling SNMP Traps section for information on how to enable this tab if it is not currently enabled.
To have HQ execute a control action when the alert is fired, click click
in the Control Action section of the page. Use the drop-downs to select the resource type and the specific resource where the control action will take place. Then select the control action to be executed using the last drop-down and click OK. *
Notification or control action is not required for an alert definition. The action part of the alert definition is meant to be as flexible as possible. Any combination of notifications and control action is allowed.
* Available through Hyperic HQ Enterprise subscription