Monitoring Rules Framework
Standard Rules Framework
By establishing a standard templating method of describing monitoring types it should be easier communicate and define how networks, devices, and services are monitored. Zabbix specific issues and syntax should not appear in these definitions. Instead monitoring types need to be defined in as clear and easy to understand way (for humans) as possible. This includes descriptions as well as algorithms.
By expanding the monitoring logic from the sole realm of the Zabbix savy administrator it is possible to include many more in the design process. In particular engineers responsible for infrastructure components can be an active part of the process in defining router, switch, firewall and other infrastructure monitoring logic. Engineers working with application layer services (DNS, Web, Email, etc) can also actively participate bringing their expertise to bear on problems specific to these applications.
Once definitions are established and agreed upon the Zabbix engineer can convert these to Zabbix specific Items, Triggers, and Actions.
Template Definition
The standard template includes the name, description, base testing algorithm, test frequency, data type returned, and trigger and action details for both entering and leaving a problem state. The information should be stated in as clear to understand manner as possible, but with enough information present so that it can be encoded properly into the Zabbix framework. The template structure should look like:
Name | Description |
---|---|
Description | Verbose description of the resource to test |
Test Algorithm | Verbose description of the test algorithm |
Test Return Value Type | One of Boolean, Integer, Float, or String |
Frequency | How often to run the test |
PROBLEM STATE ENTRY | |
Trigger Algorithm | Description of the logic to enter a PROBLEM state |
Action Definition | Description of what to do when we enter a PROBLEM state |
PROBLEM STATE EXIT | |
Trigger Algorithm | Description of the logic to exit a PROBLEM state |
Action Definition | Description of what to do when we exit a PROBLEM state |
The rest of this document includes a few monitoring definitions for remotely monitored services. These are to be considered only starting points as many additional resources need both remote as well as on-network monitoring. The entries below detail only a few remote monitoring possibilities. Aside from the DNS monitoring and 3-Phase Web Monitor all others do not involve any triggers or actions and are used only for online graph analysis. Obviously in a operational network one would want to not only take note of service availability trends but to be notified should critical problems or service outages be detected.
Remote DNS Monitoring
Remote DNS monitoring sends out a DNS request to a specific nameserver and checks for a response within a pre-defined period of time. The DNS query may differ depending on what type of nameserver we are monitoring.
Name | Description |
---|---|
Description | Test remote DNS server by sending DNS request. |
Test Algorithm | Send either a NS or SOA record request to the remote server. 1 second timeout, up to 2 attempts per test run |
Test Return Value Type | Boolean |
Frequency | Every 20 seconds |
PROBLEM STATE ENTRY | |
Trigger Algorithm | >80% failure rate over a 10-minute period |
Action Definition | Send notification message to CRITICAL list with traceroute details |
PROBLEM STATE EXIT | |
Trigger Algorithm | <10% failure rate over a 10-minute period |
Action Definition | Send notification message to CRITICAL list with traceroute details |
Remote Web Monitoring
Remote web monitoring involves the testing system emulating a live web browser and trying to download one or more pages (using HTTP/HTTPS) from a remote server. For definitions that require more than one page test be sure to include the details for each test and the criteria for determining if the monitoring event should succeed or not.
Name | Description |
---|---|
Description | Test availability of remote web server |
Test Algorithm | Attempt to download 3 pages from the remote server. 200 Status return on success for each page, 1 retry attempt, 15-second timeout |
Test Return Value Type | Boolean |
Frequency | Every 30 seconds |
PROBLEM STATE ENTRY | |
Trigger Algorithm | two successive test failures |
Action Definition | Send notification message to CRITICAL list with traceroute details |
PROBLEM STATE EXIT | |
Trigger Algorithm | successful test run |
Action Definition | Send notification message to CRITICAL list with traceroute details |
Name | Description |
---|---|
Description | Test availability of remote web server – simple test |
Test Algorithm | Attempt to download 1 page from the remote server. 200 Status return on success for each page, 1 retry attempt, 15-second timeout |
Test Return Value Type | Boolean |
Frequency | Every 60 seconds |
PROBLEM STATE ENTRY | |
Trigger Algorithm | NONE: For graphical monitoring use only |
Action Definition | |
PROBLEM STATE EXIT | |
Trigger Algorithm | NONE: For graphical monitoring use only |
Action Definition |
Remote SMTP Monitoring
Remote SMTP (Simple Mail Transfer Protocol) monitoring tests to see if a remote mail server is visible and responding.
Name | Description |
---|---|
Description | Check if remote SMTP server is running and accepting TCP connections |
Test Algorithm | Connect via TCP to remote port 25 |
Test Return Value Type | Boolean |
Frequency | Every 60 seconds |
PROBLEM STATE ENTRY | |
Trigger Algorithm | NONE: For graphical monitoring use only |
Action Definition | |
PROBLEM STATE EXIT | |
Trigger Algorithm | NONE: For graphical monitoring use only |
Action Definition |
Remote IMAP Monitoring
Remote IMAP (Internet Message Access Protocol) monitoring tests to see if a remote email message store is visible and responding.
Name | Description |
---|---|
Description | Check if remote IMAP server is running and accepting TCP connections |
Test Algorithm | Connect via TCP to remote port 143 |
Test Return Value Type | Boolean |
Frequency | Every 60 seconds |
PROBLEM STATE ENTRY | |
Trigger Algorithm | NONE: For graphical monitoring use only |
Action Definition | |
PROBLEM STATE EXIT | |
Trigger Algorithm | NONE: For graphical monitoring use only |
Action Definition |