Monitoring Rules Framework

Standard Rules Framework

By establishing a standard templating method of describing monitoring types it should be easier communicate and define how networks, devices, and services are monitored. Zabbix specific issues and syntax should not appear in these definitions. Instead monitoring types need to be defined in as clear and easy to understand way (for humans) as possible. This includes descriptions as well as algorithms.

By expanding the monitoring logic from the sole realm of the Zabbix savy administrator it is possible to include many more in the design process. In particular engineers responsible for infrastructure components can be an active part of the process in defining router, switch, firewall and other infrastructure monitoring logic. Engineers working with application layer services (DNS, Web, Email, etc) can also actively participate bringing their expertise to bear on problems specific to these applications.

Once definitions are established and agreed upon the Zabbix engineer can convert these to Zabbix specific Items, Triggers, and Actions.

Template Definition

The standard template includes the name, description, base testing algorithm, test frequency, data type returned, and trigger and action details for both entering and leaving a problem state. The information should be stated in as clear to understand manner as possible, but with enough information present so that it can be encoded properly into the Zabbix framework. The template structure should look like:

Standard Template Format
Name	Description
Description	Verbose description of the resource to test
Test Algorithm	Verbose description of the test algorithm
Test Return Value Type	One of Boolean, Integer, Float, or String
Frequency	How often to run the test
PROBLEM STATE ENTRY
Trigger Algorithm	Description of the logic to enter a PROBLEM state
Action Definition	Description of what to do when we enter a PROBLEM state
PROBLEM STATE EXIT
Trigger Algorithm	Description of the logic to exit a PROBLEM state
Action Definition	Description of what to do when we exit a PROBLEM state

The rest of this document includes a few monitoring definitions for remotely monitored services. These are to be considered only starting points as many additional resources need both remote as well as on-network monitoring. The entries below detail only a few remote monitoring possibilities. Aside from the DNS monitoring and 3-Phase Web Monitor all others do not involve any triggers or actions and are used only for online graph analysis. Obviously in a operational network one would want to not only take note of service availability trends but to be notified should critical problems or service outages be detected.

Remote DNS Monitoring

Remote DNS monitoring sends out a DNS request to a specific nameserver and checks for a response within a pre-defined period of time. The DNS query may differ depending on what type of nameserver we are monitoring.

Example: Remote DNS Monitoring
Name	Description
Description	Test remote DNS server by sending DNS request.
Test Algorithm	Send either a NS or SOA record request to the remote server. 1 second timeout, up to 2 attempts per test run
Test Return Value Type	Boolean
Frequency	Every 20 seconds
PROBLEM STATE ENTRY
Trigger Algorithm	>80% failure rate over a 10-minute period
Action Definition	Send notification message to CRITICAL list with traceroute details
PROBLEM STATE EXIT
Trigger Algorithm	<10% failure rate over a 10-minute period
Action Definition	Send notification message to CRITICAL list with traceroute details

Remote Web Monitoring

Remote web monitoring involves the testing system emulating a live web browser and trying to download one or more pages (using HTTP/HTTPS) from a remote server. For definitions that require more than one page test be sure to include the details for each test and the criteria for determining if the monitoring event should succeed or not.

Example: Remote Web Monitoring
Name	Description
Description	Test availability of remote web server
Test Algorithm	Attempt to download 3 pages from the remote server. 200 Status return on success for each page, 1 retry attempt, 15-second timeout
Test Return Value Type	Boolean
Frequency	Every 30 seconds
PROBLEM STATE ENTRY
Trigger Algorithm	two successive test failures
Action Definition	Send notification message to CRITICAL list with traceroute details
PROBLEM STATE EXIT
Trigger Algorithm	successful test run
Action Definition	Send notification message to CRITICAL list with traceroute details

Example: Remote Web Monitoring (Phase 1)
Name	Description
Description	Test availability of remote web server – simple test
Test Algorithm	Attempt to download 1 page from the remote server. 200 Status return on success for each page, 1 retry attempt, 15-second timeout
Test Return Value Type	Boolean
Frequency	Every 60 seconds
PROBLEM STATE ENTRY
Trigger Algorithm	NONE: For graphical monitoring use only
Action Definition
PROBLEM STATE EXIT
Trigger Algorithm	NONE: For graphical monitoring use only
Action Definition

Remote SMTP Monitoring

Remote SMTP (Simple Mail Transfer Protocol) monitoring tests to see if a remote mail server is visible and responding.

Example: Remote SMTP Monitoring
Name	Description
Description	Check if remote SMTP server is running and accepting TCP connections
Test Algorithm	Connect via TCP to remote port 25
Test Return Value Type	Boolean
Frequency	Every 60 seconds
PROBLEM STATE ENTRY
Trigger Algorithm	NONE: For graphical monitoring use only
Action Definition
PROBLEM STATE EXIT
Trigger Algorithm	NONE: For graphical monitoring use only
Action Definition

Remote IMAP Monitoring

Remote IMAP (Internet Message Access Protocol) monitoring tests to see if a remote email message store is visible and responding.

Example: Remote IMAP Monitoring
Name	Description
Description	Check if remote IMAP server is running and accepting TCP connections
Test Algorithm	Connect via TCP to remote port 143
Test Return Value Type	Boolean
Frequency	Every 60 seconds
PROBLEM STATE ENTRY
Trigger Algorithm	NONE: For graphical monitoring use only
Action Definition
PROBLEM STATE EXIT
Trigger Algorithm	NONE: For graphical monitoring use only
Action Definition

Tim's Place