SQL Alerts

Note: While SQL-based alerting remains available for customers with existing SQL-based alerts, users are encouraged to create all new alerts with Policy-based alerting.

The following topics provide a general overview of the SQL-based alerting system in Kentik Detect:

Notes:
- For information on how the portal is used to configure, edit, and control SQL-based alerts, see SQL Alert Settings.
- For information on Kentik Detect’s policy-based alerting system, see Policy Alert Overview.

 

About SQL Alerts

The Kentik Detect alerting system provides for alerts that result in notification of subscribed users when the network data ingested into Kentik Detect meets a set of user-defined conditions. SQL Alerts are configured by administrators on a customer-wide basis. They may be created from scratch (e.g. on the Add SQL Alert page of the portal) or customized (on the Edit alert page) from SQL Alert Presets (templates) provided by Kentik (found in the SQL Alerts List table).

The alerting system uses a two-tiered threshold to determine when to enter an alarm state, thereby triggering notifications, and it uses two user-configurable criteria to determine when to reset (clear the alarm state) as conditions change. Details on how the alerting system works are covered in the following topics.

 

SQL Alert Terminology

The Kentik Detect alerting system uses the following terminology (further details of which are explained in later sections):

  • SQL Alert: The umbrella term for a single SQL alert definition made in the portal or via the alerting API. Each alert encompasses a query, which includes a definition of the alert’s key, and additional configurable parameters such as the criticality, the event window, the trigger threshold, and the reset period.
  • SQL Alert table: The table that is used by the alerting system to keep track of matches and events for a given alert.
  • Key: The value of a column in the Main Table (see KDE Tables). Each key is an instance of the class of “thing” (e.g. ASN, IP address, or interface) about which the alert monitors data (e.g. traffic volume).
  • Output values: Any two Main Table columns that may optionally be designated (e.g. $_out_in_bytes, $_out2_ in_pkts) in the alert query. The values of those fields will be included in alert notifications.
  • SQL poll frequency: The time interval at which an alert’s query is run, every 30, 60 (default), or 120 seconds. This is a global setting (see Global SQL Alert Settings).
  • Lookback window: The timespan (back in time from “now”) over which flow is averaged each time an alert query is run. May be 30, 60 (default), or 120 seconds. Typically set to match the poll frequency.
  • Match: An individual occurrence in which the flow data for a given key, averaged over the lookback window, matches the conditions specified in the alert. For each key, one match returned from the alert query is timestamped with time of the query run and added to the alert table.
    Example: The alert queries for any IPs (key = ipv4_dst_addr) that are the destination of greater than 3000 packets per second of total traffic. Each time the query is run, each IP that matches the specified condition is returned as a row and constitutes one match.
    Note: When more than one returned row has the same key, the match to add to the alert table is chosen as follows:
    - if the query includes both an out and an out2, the row with the highest sum of those fields;
    - otherwise, the first row returned.
  • Event window (specified in SQL Alert Add/SQL Alert Edit): The duration (back in time from “now”) over which the alerting system keeps a match count.
  • Match count: The count, updated at the SQL polling frequency, of times within the current event window that a match for a given key has been added to the alert table. A separate count is maintained for each key that has matches in the table.
    Example: If the event window is set to two hours then the match count for a given key includes every match for that key that was added to the alert table between “now minus two hours” and now. With each new match for that key in the table the match count is incremented. When the timestamp of a given match falls out of the event window (e.g. becomes more than two hours old), the match count for the corresponding key is decremented.
  • Trigger threshold (specified in SQL Alert Add/SQL Alert Edit): The match count required to trigger an event and enter alarm state. Since only one match is added to the table for each key at each poll, and the match count is maintained on a per-key basis, the threshold also determines the minimum number of polls required to trigger alarm state.
  • SQL Alert state: The current state of an individual alert (see SQL Alert States), either alarm (ALARM), acknowledgement required (ACK_REQ), or clear (CLEAR).
  • Event: A timespan defined by a set of matches for a given key value in the alert table that have resulted in an alert entering alarm state. The start of an event is the timestamp of the earliest match for a given key in the alert table (not the time alarm state is entered). The end of an event is the timestamp of the key’s last match before the alert leaves alarm state (which corresponds to the start of the Reset Period).
  • Notification: The alerting system’s response to a change of alert state (e.g. from clear to alarm). A notification is either an email sent to the customer’s list of alert subscribers, a syslog entry, and/or an HTTP POST to a URL (see SQL Alert Notifications).
  • Supplemental query: A query that can be run at each poll, starting with the time an alert enters alarm state. A supplemental query provides additional information about the state of the network. The supplemental query results are included in notifications (fully in URL, truncated in email and syslog). Two supplemental queries may be specified for each alert.
    Example: Supply a mitigation service with the source IP’s and interfaces of an internal IP that is being attacked on a continuous basis.
  • Reset Period: The duration for which there must be no matches in order for an alert to leave alarm state.

In a typical alert scenario, the above terminology would be applied as follows:

  • The alert was configured in the portal with a trigger threshold of 5 matches, an event window of 2 hours, and a reset period of 15 minutes.
  • The first match for the alert occurred at 3:54:12, and was added to the alert table.
  • The fifth match was added to the alert table at 5:50:37, at which point the match count reached the trigger threshold. That triggered the alarm state, resulting in a status-change notification and also an event with a start time of 3:54:12 (time of first match).
  • Several additional matches were added to the alert table between 5:50:37 and 7:05:38.
  • There were no further matches after 7:05:38, so at 7:20:38 the reset period was fully elapsed. The alert left alarm state, resulting in a status-change notification. The end-time of the event was the time of the last match (7:05:38).
 

SQL Alert States

Once configured and enabled (see Add or Edit SQL Alerts), an alert can be in one of three states:

  • Alarm (ALARM): the match count has reached the trigger threshold, and the reset period has not yet elapsed.
  • Acknowledgement required (ACK_REQ): the match count reached the trigger threshold and the reset period subsequently elapsed, but the alert must be acknowledged manually in the SQL Alerts Incidents page to be fully cleared.
    Note: Applies only to alerts whose criticality has been specified as Major or Critical.
  • Clear (CLEAR): the alert has either never entered alarm state or was in alarm state and then cleared, either automatically (Info or Minor alerts) or manually (Major or Critical alerts). While in clear state, alerts do not generate any status reminder notifications.
    Note: An alert whose criticality has been specified as Info or Minor and that has been in alarm state will revert automatically to clear state, without passing through Acknowledgement Required state, when the reset period has expired without any further matches.
 

SQL Alert Query

The core of every alert is the alert query. The query is structured to allow the user to define a set of conditions that the alerting system will watch for as flow is ingested into the Kentik Data Engine (KDE) datastore. The query is run at the SQL poll frequency (see Global SQL Alert Settings) and queries a KDE main table, either for a given device or for all devices (see KDE Tables).

An alert query includes one key and (optionally) up to two output values:

  • Key: the value of the column corresponding to the “what” that the alert is set to monitor. For example, we may want to look at IP addresses to track the traffic to each one and perhaps even activate a mitigation device to scrub or protect addresses when certain conditions are met. In this case our key would be the value of the main table column ipv4_dst_addr; in other situations it might be an AS (src/dst_as), port (input/output_port), etc. The key is designated by appending “$_key_” to the column name in the SELECT clause of the alert query. Each query can have only one key.
  • Output values: two columns containing values that we want returned about the key. These are typically the values that result in an alert match when they meet the conditions defined in the alert query. For example, if the key is IP address, the out1 value might be the number of packets (as pps) and the out2 value might be the number of bytes (as bps). To designate the column for the out1 or out2 value, insert “$_out_” or “$_out2_” before the column name in the SELECT clause of the alert query.

If the query finds at least one key value for which current conditions match the query-defined conditions then the query will return rows, one of those rows will be added to the alert table, and the match count will be incremented (see Triggering an Alarm).

Note: Every query must contain the WHERE clause ctimestamp > %lookbackseconds% (or some variant such as >=). %lookbackseconds% is a variable representing the value specified in the Lookback Window field on the Add SQL Alert or Edit SQL Alert page in the portal (see SQL Alert Field Definitions).

Example Query

The following example query assumes that you have set up a MYNETWORK tag (see MYNETWORK Tag Queries). The query checks for IPs that are outside of your network (as indicated by a MYNETWORK tag) and are sending more than 75K pps or 75 Mbps from outside of your network to one or more specified ports:

SELECT This section specifies the columns to look in for the requested data.
  now(), Time-stamp for the alert.
  i_device_id, The column (i_device_id) containing the id of the router/device where the traffic was observed.
  ipv4_dst_addr AS ipv4_dst_addr$_key_, The column that has been designated as the source of the key for this alert by appending “$_key_” to the column name (ipv4_dst_addr). In this case the value of the key will be the IP of the device receiving the traffic.
  round(sum(both_pkts)/%lookbackseconds%) AS f_sum_both_pkts$_out_pps, The column (pps) on which to run a subquery function (see Subquery Function Syntax) that returns the packets/second as pps (the sum of packets for the time period divided by the seconds in the time period). This column has been designated as an output value for this alert by inserting “$_out_” between the function name and the column name.
  round((sum(both_bytes)/%lookbackseconds%/1000000)*8) AS f_sum_both_bytes$_out2_mbps The column (bps) on which to run a subquery function that returns the bytes/second as mbps (the sum of bytes for the time period, divided by seconds in time period, divided by 1M for Mbytes, multiplied by 8 bits for mbits). This column has been designated as an output value for this alert by inserting “$_out2_” between the function name and the column name.
FROM all_devices The table/device to look at for the requested data, in this case all_devices (see All Devices Table).
WHERE This section specifies the conditions that must be true for a row to be returned from the query.
  i_device_type != 'host' The device is a router (not a host).
  AND src_flow_tags NOT LIKE '%MYNETWORK%' The source of the traffic is outside of your network (as defined by MYNETWORK tag).
  AND ctimestamp > %lookbackseconds% The traffic occurred in the last N seconds where N is defined by the variable %lookbackseconds%, which represents the Lookback Window (see SQL Alert Field Definitions).
  AND protocol=17 The traffic is UDP traffic.
  AND (l4_dst_port = 0 OR l4_dst_port = 19 OR l4_dst_port=80 OR l4_dst_port=123 OR l4_dst_port=161 OR l4_dst_port=17 OR l4_dst_port=1900 OR l4_src_port = 0 OR l4_src_port = 19 OR l4_src_port=80 OR l4_src_port=123 OR l4_src_port=161 OR l4_src_port=17 OR l4_src_port=1900) The traffic is targeted at services commonly used for DDoS reflection/amplification floods.
GROUP by i_device_id,ipv4_dst_addr Specifies that only one row is to be returned per unique combination of i_device_id and ipv4_dst_addr.
HAVING This section specifies which rows to return:
  ((sum(both_pkts)/%lookbackseconds%) > 75000) Average traffic during the lookback window is greater than 75K packets/sec,
  OR ((sum(both_bytes)/%lookbackseconds%/1000000)*8) > 75 or average traffic during the lookback window is greater than 75 Mbps.
ORDER by f_sum_both_bytes$_out2_mbps DESC Specifies that the returned rows be sorted by Mbps in descending order.
LIMIT 100 Specifies that only the first 100 rows are returned.

The alerting system would run the above query at the global SQL poll frequency, e.g. every 60 seconds, evaluating the most recent minute (set by the lookback window) of flow data. Whenever the traffic over that minute from outside of your network to the specified ports of any IP averages over 75K pps or 75 Mbps, the query would return a match. For each key (IP), one timestamped match will be added to the alerting table associated with that alert.

The following tips can help you avoid common pitfalls when constructing an alert query:

  • Rather than specifying the lookback window directly in the query you must use the %lookbackseconds% variable.
    - Incorrect: WHERE ctimestamp > 60
    - Correct: WHERE ctimestamp > %lookbackseconds%
  • When querying all devices, it is recommended in most situations to break on device_id to avoid double-counting of traffic:
    - Typically not ideal: GROUP by ipv4_dst_addr
    - Recommended: GROUP by i_device_id, ipv4_dst_addr
  • Instead of selecting in_pkts/bytes or out_pkts/bytes, you would typically use both_pkts/bytes, which will cover flow collected at both ingress and egress on interfaces, in case any routers are not set to collect on ingress:
    - Typically not ideal: GROUP by ipv4_dst_addr
    - Recommended: GROUP by i_device_id, ipv4_dst_addr
  • Use an ORDER BY clause (DESC or ASC depending on what you are looking for) to ensure that the results include the values that you are looking for.

Notes:
- The KDE interface will check that every query is properly time-bounded with a clause similar to “ctimestamp > %lookbackseconds%.” This ensures that the query looks back for the timespan set in the alert’s Lookback Window field (see SQL Alert Field Definitions).
- For a look at the columns in a KDE main table, see KDE Tables.
- For a general explanation of how queries are used in Kentik Detect see Querying KDE.
- For a general introduction to PostgreSQL see the PostgreSQL Tutorial (http://www.postgresqltutorial.com).

 

Triggering an Alarm

The alerting system uses a two-stage process for determining when to trigger an alarm. As described in SQL Alert Query, the first stage involves running an alert query at regular intervals to monitor for matches. If the query returns data, a timestamped match is stored in a table in the alert system database, and the alert’s match count is incremented.

The second stage of alerting involves the actual entering of alarm state. When the match count reaches the trigger threshold, the alert enters alarm state, which results in the issuing of a status change notification. The entering of alarm state causes the system to define an event, the start time of which is the time of the earliest match that is still within the event window.

To see how this works in practice, let’s say that you’ve set a polling frequency of 60 seconds, an event window of two hours, and a trigger threshold of five matches. Every 60 seconds the alert system will update the count of matches in the alert table whose timestamp is within the previous two hours. If the match count reaches five, the alert goes into alarm, a notification is issued, and an event begins, with a start time corresponding to the timestamp of the earliest match in the event window.

 

Clearing an Alarm

An alert system that is in alarm becomes reset-ready when either of the following becomes true:

  • The number of matches within the event window falls below the trigger threshold.
  • The time elapsed since the most recent match exceeds the reset period.

The way the system handles a reset-ready alert depends on the alert’s criticality (see SQL Alert Field Definitions):

  • Info/Minor alert: the state of an alert that reaches the end of its reset period without an additional match is automatically reclassified to clear and the alert is also automatically removed from the pending alerts table on the SQL Alerts Incidents page (see Pending SQL Alerts Table).
  • Major/Critical alert: the state of an alert that reaches the end of its reset period without an additional match is automatically reclassified to acknowledgement required, but the alert remains in the pending alerts table until it is manually cleared with the Clear button in the corresponding row of the table.

The clearing of an alert’s alarm state does not clear the system of stored matches for that alert or reset the match count. In the scenario described above, for example, if the reset period is set to 15 then the alarm will clear when there’s been no match in the previous 15 minutes. But if at the time of a subsequent match the match count is at the trigger threshold or greater, a new alarm will be triggered.

 

SQL Alert Notifications

The alerting system uses multiple modes of notification:

  • SQL Alerts Incidents page: Each alert that is in either ALARM state or acknowledge-required (ACK_REQ) state is represented as a single row in the table of pending alerts (see SQL Alerts Incidents). When an alert’s state changes to clear, the alert is removed from the dashboard, but it can still be viewed in the SQL Alerts Log.
  • URL: Notification will be sent as JSON in the message body of an HTTP POST to the specified URL:
    - Notice is posted only when the alert enters ALARM state or leaves ALARM state for either ACK_REQ state or CLEAR state (see SQL Alert States).
    - Notifications are posted within 10 seconds of the state change.
    - No notification is sent when an alarm in ACK_REQ state is set to CLEAR state.
  • Syslog: Notification will be sent as JSON to a syslog receiver IP address:
    - Notice is posted only when the alert enters ALARM state or leaves ALARM state for either ACK_REQ state or CLEAR state.
    - Notifications are posted within 10 seconds of the state change.
    - No notification is sent when an alarm in ACK_REQ state is set to CLEAR state.
  • Email: Notification will be sent to a user-specified list (comma-delimited) of addresses (see Global SQL Alert Settings). Email alerts generate two types of notifications:
    - SQL Alert status change: Issued two minutes after an alert changes state (e.g. goes from CLEAR to ALARM, or vice versa). The two-minute delay allows an additional alert query to be run before sending, thereby providing a fuller picture of the conditions triggering the alert.
    - SQL Alert reminder: issued at a user-defined interval from the time an alert enters ALARM state until it is cleared. The interval is set with the Send Updates Every field in the portal’s Global SQL Alert Settings (see Global SQL Alert Settings).
    Notes:
    - If an alert is in ALARM state for less than two minutes, an email notification may be sent about the alert leaving ALARM state without an email having been sent about the alert entering ALARM state.
    - Notification emails are sent from the email address no-reply@kd.kentik.io.

SQL Alert notifications include general information such as the date/time and a count of the alerts currently in alarm state. They also include the following details about each alert that is not in CLEAR state:

Current state Current state of alert: ALARM or ACK_REQ
Previous state Previous state of alert
Alert Name Name specified by user when alert was created
Alert ID ID assigned to the alert by Kentik when it was created
Event ID ID assigned to the event by Kentik when the alert goes into alarm state
Severity Alert severity as set on Add/Edit Alert page in portal
Description Description text from Add/Edit Alert page in portal
Key name Name of the field designated in the alert query as the key
Key value Value returned from query for the key field
Out1 name Name of the field designated in the alert query as the source of the out value
Out1 value Value returned from query for the out field
Out2 name Name of the field designated in the alert query as the source of the out2 value
Out2 value Value returned from query for the out2 field
# Queries over thresh Count of matches so far in the event
Time: First over-thresh Time of earliest match in the event window
Time: Last over-thresh Time of most recent match in the event window
SQL The alert query
Query result The data returned from the most recent run of the alert query
Supplemental query 1 result The data returned from the most recent run of supplemental query 1
Supplemental query 2 result The data returned from the most recent run of supplemental query 2

 

SQL Alert History

The alerting system uses internal history tables to store information related to matches and events, and makes that information available in the SQL Alerts Log. This historical information is retained as follows:

  • Event history: 90 days.
  • Match history: 30 days.
© 2014- Kentik
In this article:
×