CH logo® Knowledge Base
Contents Search
   

 

Querying KDE

Note: This article discusses Kentik's implementation of SQL to enable querying the Kentik Data Engine with the Query SQL Method of the V5 Query API. SQL querying of KDE directly (with a psql client) is deprecated. For additional information please contact Kentik support.

General considerations related to querying the KDE Main Table (KDE Tables), as well as Kentik-specific SQL functions, are covered in the following topics:

Notes:
- For general information about KDE and its tables, see KDE Overview.
- You must understand subquery functions and time management in order to write a KDE query that returns meaningful results.
- Rows will only be returned from a query if the SELECT statement includes one or more of the following main table columns (see Main Table Schema): in_bytes, in_pkts, out_bytes, out_pkts, both_bytes, both_pkts, tcp_retransmit.

 

 
 top

About SQL Querying

As discussed in KDE Overview, the Kentik Data Engine (KDE) datastore was developed by Kentik to store historical and real-time network traffic data. Behind the scenes, Kentik Detect uses SQL in the queries that access data in KDE. These queries are typically directly exposed to users only via the Query SQL Method of the V5 Query API. In most cases these queries use standard PostgreSQL, but the special Kentik-specific functions covered in this article are sometimes required.

 

 
 top

Time Management in Queries

All times in KDE are represented using UTC (Coordinated Universal Time, a.k.a. ITU-R TF.460-6). Additional information about times is covered in the following sections:

 

 
 top  |  section

Time Rounding

Depending on the time range covered by a query and that query’s dataseries (Full or Fast; see Resolution Overview), Kentik Detect may apply rounding to the following time values:

  • The start and end times of the specified time range.
  • The interval (granularity) used for reporting packets and bytes.

Kentik Detect time rounding is summarized in the following table, with additional explanation provided in the topics below:

Duration of query time range Dataseries Width of aggregation steps
< 3h Full 1 minute
>= 3h and < 24h Full 5 minute
>= 24h Full 10 minute
Any Fast 60 minute

 

Rounding for Fast Dataseries

The start and end time of queries run on the Fast dataseries always “snap” to the closest hour. For example, if a query’s timespan is from 1:05 PM on 5/12 to 10:31 PM on 5/15, returned results will cover 1:00 PM on 5/12 to 11:00 PM on 5/15.

 

Rounding for Full Dataseries

For queries run on the Full dataseries, the rounding of times specified in a query is a two-stage process. The first stage of time rounding involves shifting the start and end of the time range used for longer queries (greater than five minutes) to the boundaries of the logical “slices” used to store data in KDE. Slices for the Full dataseries are one minute (see Subqueries, Slices, and Shards). The following table shows how the times specified in a query of the Full dataseries are modified when the query is initially processed by Kentik Detect:

Operator applied
to specified
start | end value
Change to specified time value Example for:
- Fast dataset (slice width = 1 min)
- Time specified as 12:00:30
If query-specified time range > 5 minutes:
< Strip seconds, end before resulting slice - Last slice used in series starts at 11:59.
<= Strip seconds, end with resulting slice - Last slice used in series starts at 12:00.
>= Strip seconds, start with resulting slice - First slice in series starts at 12:00
> Strip seconds, start after resulting slice - First slice in series starts at 12:01
If query-specified time range <= 5 minutes:
< None (end before specified second) - Last slice used in series starts at 12:00;
- ctimestamp is filtered for < 12:00:30.
<= None (end with specified second) - Last slice used in series starts at 12:00;
- ctimestamp is filtered for <= 12:00:30.
>= None (start with specified second) - First slice used in series starts at 12:00;
- ctimestamp is filtered for >= 12:00:30.
> None (start after specified second) - First slice used in series starts at 12:00;
- ctimestamp is filtered for > 12:00:30.

For the second stage of rounding for the Full dataseries, Kentik Detect starts with the start/end time values output from the first stage. These values are then rounded down (for start times) or up (for end times) to the nearest aggregation step boundary, which will vary depending on the duration of the query time range that results from the first stage.

 

 
 top  |  section

Time Types

KDE supports two time types that may be utilized interchangeably to view, filter, or group by time:

  • ctimestamp
  • i_start_time

Note: If per-second granularity is required, express time using ctimestamp.

 

Using ctimestamp

The way a ctimestamp time value in a query is interpreted by the interface depends on context:

  • If the ctimestamp value is in a WHERE clause and the specified value is less than 604801, then the time will be calculated from the specified value by applying the following formula:
    time_value = now() - specified_value seconds
  • In all other situations the specified ctimestamp value will be interpreted by the interface as being in epoch timeformat (seconds since midnight, 1/1/1970), for example 1418177100.

When displayed, ctimestamp is always represented as epoch seconds.

Notes:
- ctimestamp can be displayed in timestamp format (2014-12-10 01:55:01+00) utilizing the to_timestamp() function.
- group of ctimestamp will aggregate on a per-second basis.

The following examples illustrate different ways to use ctimestamp.

Display the last 10 seconds of data:

SELECT ctimestamp,
  in_pkts
FROM all_devices
WHERE ctimestamp > 10
LIMIT 200;

Display data in specified time range:

SELECT ctimestamp,
  in_pkts
FROM all_devices
WHERE ctimestamp > 1418177100 OR ctimestamp < 1418177200
LIMIT 200

Display the epoch and timestamp for all flows in the last 10 seconds, with ctimestamp displayed in timestamp format:

SELECT ctimestamp,
  to_timestamp(ctimestamp),
  in_pkts
FROM all_devices
WHERE ctimestamp > 10
ORDER by ctimestamp
LIMIT 200;

Aggregate grouping of ctimestamp on a per-second basis:

SELECT ctimestamp,
  sum(in_pkts) AS f_sum_in_pkts
FROM all_devices
WHERE ctimestamp > 45
GROUP by ctimestamp
ORDER by ctimestamp
LIMIT 200;

 

Using i_start_time

i_start_time utilizes timestamp format (2014-12-10 01:55:01) without a timezone specified. For both filtering and display i_start_time is rounded as described in Time Rounding, except that for queries whose timerange is less than 3 minutes the display of times will be rounded to the nearest minute. The following examples illustrate different ways to use i_start_time.

This example shows conversion of i_start_time to epoch using the extract(epoch) function, with rounding applied to the display of i_start_time:

SELECT i_start_time,
  extract(epoch from i_start_time),
  ctimestamp,
  in_pkts
FROM all_devices
WHERE ctimestamp > 10
ORDER by ctimestamp
LIMIT 200;

This example shows filtering with i_start_time (e.g. where clause):

SELECT i_start_time,
  in_pkts
FROM all_devices
WHERE i_start_time > '2014-12-10 02:14:36'
  AND i_end_time < '2014-12-10 02:17:55'
ORDER by ctimestamp
LIMIT 200;

This example shows grouping functionality, which is one of the primary use cases for which the interface snaps the display of i_start_time to the earliest minute (or longer for wide queries, as noted in Time Rounding):

SELECT i_start_time,
  sum(in_pkts) AS f_sum_in_pkts
FROM all_devices
WHERE i_start_time > '2014-12-10 02:00:36'
  AND i_end_time < '2014-12-10 02:21:55'
GROUP by i_start_time
ORDER by i_start_time
LIMIT 200;


 

 
 top

Traffic Reporting

The factors that influence the reporting of packets and bytes from a main table in response to a query are covered in the following sub-topics:

 

 
 top  |  section

Device Type

The type of device affects how traffic is stored and reported:

  • Router traffic: Flow from routers is typically collected on ingress and therefore reported in the in_pkts and in_bytes fields only. It is possible, however, to configure some routers to export on egress, thereby populating the out_bytes and out_pkts fields.
    Note: Configuring a router to report on both ingress and egress will likely result in double-reporting of that router’s traffic.
  • Host traffic: Because a host (server) is generating traffic and is not routing, a host must report both in and out on an interface. Therefore the host agent used by Kentik Detect (e.g. nProbe) will report both in_pkts/in_bytes and out_pkts/out_bytes.

Note: To check the type and configuration of a device, find the device in the Device List in the Kentik Detect portal (Admin » Devices) and click on either the device’s name or the Edit icon in the device’s row.

 

 
 top  |  section

Reporting Point

Bytes/packets are counted at the following points:

  • in_bytes and in_pkts: Reported on ingress, meaning as traffic enters an interface (and exits a cross-router interface as reported in the out_port field);
  • out_bytes and out_pkts: Reported on egress, as traffic exits an interface.

 

 
 top  |  section

Reporting Interval

The interval used for reporting packets and bytes varies depending on the time range covered by the query. These intervals are used to normalize to bits/second (bps) and packets/second (pps). For details, see Time Management in Queries.

 

 
 top

TCP Flag Filtering

Kentik Detect allows users to filter query results by TCP flags that were set on the flow using a flow mask. Query results may be filtered by the following flags:

  • NS: ECN-nonce concealment protection.
  • CWR: Congestion Window Reduced.
  • ECE: ECN-Echo.
  • URG: urgent.
  • ACK: Acknowledgment.
  • PSH: Push.
  • SYN: Synchronize sequence numbers.
  • FIN: No further data.

Here’s an example of a query involving filtering by TCP flags, in this case looking for IPs in a network that have received SYN-only flows from outside of that network over the last minute:

SELECT ipv4_dst_addr,
  round(sum(in_pkts)/60) AS f_sum_in_pkts$pps,
  round((sum(in_bytes)/60)/1000)*8) AS f_sum_in_bytes$kbps
FROM all_devices
WHERE ctimestamp > 60
  AND tcp_flags = 2
  AND dst_flow_tags LIKE '%MYNETWORK%'
  AND src_flow_tags NOT LIKE '%MYNETWORK%'
GROUP by ipv4_dst_addr
ORDER by f_sum_in_pkts$pps DESC
LIMIT 20;


 

 
 top

Tag-based Queries

The use of tags in KDE queries is covered in the following topics:

 

 
 top  |  section

About Tags

The KDE query interface includes support for tags that are created on the Tagging page of the Kentik Detect portal (choose Tagging from the drop-down Admin menu, then click the Add Tag button). The tags are defined by specifying values in one or more tag fields on that page. For further information on the tag fields and how to use them, see the Knowledge Base article Tag Settings.

 

 
 top  |  section

Populating Tag Columns

To support tagging capabilities, the KDE main table for each device includes two columns: src_flow_tags and dst_flow_tags. These columns are populated as flow is ingested by Kentik Detect, with the values specified in the tag fields (Add/Edit Tags page) being compared to the corresponding fields in the flow (or, in the case of ASN and community, information derived from the values of the fields in the flow itself). This comparison occurs twice, once for source fields and once for destination fields:

  • If all of the values specified in the ANDed tag fields for a given tag are matched in SRC-related flow fields (i.e. SRC IP, SRC port, ASN path associated with SRC IP, or communities associated with SRC IP) then the tag’s name is appended as text to the existing tags (if any) in the src_flow_tags column for that flow.
  • If all of the values specified in the ANDed tag fields for a given tag are matched in DST-related flow fields (i.e. DST IP, DST port, ASN path associated with DST IP, or communities associated with DST IP) then the tag’s name is appended as text to the existing tags (if any) in the dst_flow_tags column for that flow.

Note: Because the tag fields are ANDed for each of the two comparison operations described above, a tag will be applied only when all tag fields are matched.

The tag fields on the Add/Edit Tags page allow reference to both devices and interfaces, and the application of tags will vary depending on if and how each of those fields is specified:

  • If a tag references a device (i.e. you list the device in the Device Names or IP Addresses tag field) then when Kentik Detect receives flow from that device the tag is applied to both src_flow_tags and dst_flow_tags. For example, let’s assume that you have a set of devices named LVS1, LVS2, and LVS3 and that you list those names in the Device Names tag field when defining a tag named LVS. If the received flow for a given device reports traffic whose source is LVS1, LVS2, or LVS3 then the tag LVS will be appended to both the src_flow_tags and dst_flow_tags columns of that device’s main table.
  • If a tag references an interface (i.e. you list the interface in the Interface Names or Descriptions tag field), then a tag is applied to src_flow_tags if the received flow shows traffic entering on that interface, and a tag is applied to dst_flow_tags if the received flow shows traffic leaving on that interface.

Note: Because Kentik Detect stores complete flow rather than summaries, historical queries are not limited to flow attributes that have been defined with tags in advance. However, because the tag fields for a given flow are populated as the flow is ingested into KDE, tags cannot be applied retroactively to rows that are already in the database.

 

 
 top  |  section

Tag Name Case

Tag names are forced to all-caps when tags are saved in the portal’s Add/Edit Tag pages. Thus a tag name entered as “lsv1” or “Lsv1” becomes “LSV1” and must be queried as such. When querying via the Kentik Detect portal, tag names entered in lower or mixed case are automatically converted to upper case. If you access your data via an external PostgreSQL interface, however, you must enter tag names in all-caps.

Note: Tag names are made up entirely of alphanumeric characters, underscores (“_”), and hyphens (“-”). Spaces and other special characters are not allowed.

 

 
 top  |  section

Queries Using Tags

The result of KDE’s tag preprocessing at ingest is that the src_flow_tags and dst_flow_tags columns of each device’s main table contain a delimited list of tags (all upper case) that can be searched as part of a KDE query. These searches typically involve a WHERE clause with a LIKE operator, which looks for a pattern matching the single-quoted string following LIKE. To return a valid result that string must begin and end with the SQL wildcard “%” (percent character), which enables matching within the lists of tags in the src_flow_tags and dst_flow_tags columns.

The LIKE operator means that a query will match any instance of the specified string that is contained within the specified search column (src_flow_tags or dst_flow_tags). Suppose, for example, that you’ve created a set of tags named MYTAG1, MYTAG2, and MYTAG3. A query with a WHERE clause using the string “MYTAG” would produce a match on all three of those tags. This means that you can use tag names to create groups of “like” things and match them conveniently. But it also means that, using our example, if you had another tag named simply MYTAG you would not be able to match that tag without also matching the MYTAG1, MYTAG2, and MYTAG3 tags.

 

 
 top  |  section

MYNETWORK Tag Queries

Kentik recommends that every customer create a tag called MYNETWORK that indicates traffic whose source or destination is their network. A MYNETWORK tag allows a user to quickly and easily see whether traffic came from an internal or external source and is headed toward an internal or external destination.

For example, using the following two LIKE conditions will return only flows from internal traffic:

src_flow_tag LIKE '%MYNETWORK%'
  AND dst_flow_tag LIKE '%MYNETWORK%'

Alternatively, the following two LIKE conditions will match only flow from traffic that has entered your network from the outside (excludes internal-to-internal and internal-to-external traffic):

src_flow_tag NOT LIKE '%MYNETWORK%'
  AND dst_flow_tag LIKE '%MYNETWORK%'

Note: For information on creating a MYNETWORK tag, see Add a MYNETWORK Tag.

The following examples of tag-related queries assume that you’ve created a MYNETWORK tag. This query would display MBps over the last 60 seconds for all traffic originating from internal ASNs:

SELECT ((sum(in_bytes)*8)/60)/1000000 AS f_sum_in_bytes$MBps
FROM all_devices
WHERE ctimestamp > 60
  AND src_flow_tags LIKE '%MYNETWORK%';

This query would display MBps over the last 60 seconds for all traffic that did not originate from your internal ASNs but whose destination is among your internal ASNs:

SELECT ((sum(in_bytes)*8)/60)/1000000 AS f_sum_in_bytes$MBps
FROM all_devices
WHERE ctimestamp > 60
  AND dst_flow_tags LIKE '%MYNETWORK%'
  AND src_flow_tags NOT LIKE '%MYNETWORK%';


 

 
 top

Additional Useful Queries

The following examples cover some additional useful queries.

Return kpps and kBps over the last hour, grouped by minute (the first minute is skipped as it is likely incomplete most of the time):

SELECT i_start_time,
  round(sum(in_pkts)/(3600)/1000) AS f_sum_in_pkts,
  round(sum(in_bytes)/(3600)/1000)*8 AS f_sum_in_bytes
FROM all_devices
WHERE ctimestamp > 3660
  AND ctimestamp < 60
GROUP by i_start_time
ORDER by i_start_time DESC
LIMIT 1000;

Return the top ten ASNs by source over the last 10 minutes excluding internal traffic (this assumes that you have set up an ASN tag called MYNETWORK on your internal devices; see MYNETWORK Tag Queries):

SELECT src_as,
  round(sum(in_pkts)/(600)/1000) AS f_sum_in_pkts,
  round(sum(in_bytes)/(600)/1000)*8 AS f_sum_in_bytes
FROM all_devices
WHERE ctimestamp > 660
  AND ctimestamp < 60
  AND src_flow_tags NOT LIKE '%MYNETWORK%'
GROUP by src_as
ORDER by f_sum_in_bytes DESC
LIMIT 10;

Using the IN operator in a WHERE clause, return the top 20 ASNs with pps > 1000, displayed as pps and kbps:

SELECT ipv4_dst_addr,
  round(sum(in_pkts)/60) AS f_sum_in_pkts$pps,
  round(sum(in_bytes)/(60)/1000)*8 AS f_sum_in_bytes$kbps
FROM all_devices
WHERE ctimestamp > 60
  AND tcp_flags IN (2,3,4)
  AND dst_flow_tags LIKE '%MYNETWORK%'
  AND src_flow_tags NOT LIKE '%MYNETWORK%'
GROUP by ipv4_dst_addr HAVING sum(in_pkts)/60 > 100
ORDER by f_sum_in_pkts$pps DESC
LIMIT 20;

Note: The NOT IN syntax is not currently supported.

Return the top ten 24-bit CIDR blocks (“ /24s”) over the last minute that are sourcing SYN traffic to your network, displayed as pps:

SELECT ipv4_src_addr AS f_cidr24_ipv4_src_addr,
  round(sum(in_pkts)/60) AS f_sum_in_pkts$pps,
  round(sum(in_bytes)/(60)/1000)*8 AS f_sum_in_bytes$kbps
FROM all_devices
WHERE ctimestamp > 60
  AND tcp_flags = 2
  AND dst_flow_tags LIKE '%MYNETWORK%'
GROUP by f_cidr24_ipv4_src_addr
ORDER by f_sum_in_pkts$pps DESC
LIMIT 10;

Return the top talker, determined by bytes sent, for each 12 minute slice over a one-day period, displayed in Kpkts, Kbps:

SELECT i_start_time,
  ipv4_dst_addr,
  round(sum(in_pkts)/(60*12)/1000) AS f_sum_in_pkts,
  round(sum(in_bytes)/(60*12)/1000)*8 AS f_sum_in_bytes
FROM all_devices
WHERE ctimestamp > 86400
GROUP by i_start_time,
  ipv4_dst_addr
ORDER by f_sum_in_bytes DESC
LIMIT 1000;

TCP Retransmits (host devices only):

SELECT i_start_time,
  'totals' AS totals,
  sum(tcp_retransmit) AS f_sum_tcp_retransmit,
  sum(out_pkts) AS f_sum_out_pkts
FROM prolo2_readnews_com
WHERE i_start_time >= now() - interval'1 hour'
GROUP BY i_start_time
ORDER BY i_start_time ASC


 

 
 top

Subquery Function Syntax

Queries to KDE are typically broken into subqueries, which means that in some situations special functions are required to return the intended results. Implementation of special functions in the KDE interface involves syntax that is not part of standard SQL, which changes the way that certain queries are written. Queries that use functions must be structured precisely to ensure that they are handled correctly by KDE’s middleware layer.

The following topics cover aspects of subquery syntax:

Note: For information on how subqueries are used in KDE see Subqueries, Slices, and Shards.

 

 
 top  |  section

Function structure

KDE special functions are made up of the following three parts that are joined by underscores:

  • The function prefix (always f);
  • The function name (may include underscores);
  • The column name (name of the column on which to run the function; may include underscores).

In the resulting structure, a function whose function name is sum and column name is in_bytes would look like this:

f_sum_in_bytes


 

 
 top  |  section

Passing Functions with AS

Functions are passed to the middleware layer via the AS keyword, overriding that keyword’s normal role of defining a column name for returned results. A function passed with AS looks like this:

SELECT sum(in_bytes) AS f_sum_in_bytes


 

 
 top  |  section

Alternate AS

When AS is used to pass a function, the normal role of the AS keyword is handled by terminating the AS string with a $ (dollar sign character) and then appending the string that would otherwise have immediately followed AS (which will define the column name in the results). For example, an “alternate AS” specifying that the returned column should be labeled as “MBps” (MegaBytes per second) would look like this:

SELECT sum(in_bytes) AS f_sum_in_bytes$MBps

One consequence of enabling this alternate approach to AS functionality is that it must be used in any SELECT statement where the column being selected is the same as the group-by column specified in a GROUP clause in the same query. In the examples below, for instance, the column ipv4_dst_addr is both the selected column and the group-by column. In the first example, the desired column name for the results is correctly specified using the alternate AS approach ($_IP_ADDR). The second example, in which the desired column name for the results is specified directly in the AS statement, results in an error:

-- GROUP is same as SELECT, so the name of the returned column is specified with the alternate AS method:
SELECT ipv4_dst_addr AS ipv4_dst_addr$_IP_ADDR,
  round(sum(IN_pkts)/60) AS f_sum_IN_pkts$_pps_
FROM ALL_devices
WHERE protocol = 17
  AND ctimestamp > 60
GROUP by ipv4_dst_addr HAVING sum(IN_pkts)/60 > 1000
LIMIT 100;
-- Using the standard AS method will result in an error:
SELECT ipv4_dst_addr AS IP_ADDR,
  round(sum(IN_pkts)/60) AS f_sum_IN_pkts$_pps_
FROM ALL_devices
WHERE protocol = 17
  AND ctimestamp > 60
GROUP by ipv4_dst_addr HAVING sum(IN_pkts)/60 > 1000
LIMIT 100;


 

 
 top

General Subquery Considerations

In addition to specific syntax related to subquery functions, additional subquery-related considerations apply when querying the KDE:

 

 
 top  |  section

Subquery Result Limiting

As described in Subquery Function Syntax, KDE queries are typically broken into subqueries, each of which queries an individual KDE main table. Each main table is populated with the flow data received from a given device (router or host) for a time-slice. For tables in the Full dataseries (see KDE Resolution) that time-slice is one minute; for Fast dataseries tables it’s one hour.

KDE’s middleware supports limiting the number of results returned by each subquery from its corresponding main table. Subquery result limiting is accomplished by the inclusion in queries of an i_sub_limit statement in the WHERE clause.

The primary use case for subquery limiting involves querying for top IP addresses across a time range. Limiting the rows returned from each subquery allows you to return the top IP addresses at each time point and thus accomplish with a single main query what would otherwise involve two queries. For example, if querying for top IPs with i_sub_limit = 2, then for every main table queried by a subquery only the top two IPs are contributed to the main query results.

Here’s an example of a query for top-10 ASNs in which i_sub_limit is used to restrict the number of results from each subquery to seven.

SELECT i_start_time,
  dst_as,
  f_sum_both_bytes AS sum_both,
  i_duration
FROM (SELECT i_start_time,
  dst_as,
  f_sum_both_bytes,
  i_duration,
  row_number() OVER (partition BY i_start_time order by f_sum_both_bytes DESC)
FROM (SELECT i_start_time,
  dst_as,
  sum(both_bytes) AS f_sum_both_bytes,
  max(i_duration) AS i_duration
FROM all_devices
WHERE (dst_flow_tags LIKE '%tag_name%' OR src_flow_tags LIKE '%tag_name%')
  AND i_device_type = 'router'
  AND i_start_time >= '2015-07-23 13:01:53'
  AND i_end_time < '2015-07-23 14:01:53'
  AND i_sub_limit=7
  AND i_fast_dataset=FALSE
GROUP BY i_start_time,
  dst_as
ORDER BY i_start_time,
  f_sum_both_bytes DESC) tmp) tmp2 where row_number <= 7;


 

 
 top  |  section

Querying on Strings

One additional general consideration related to Kentik’s use of subqueries involves how strings are included in a query. Specifically, when matching for an empty string (i.e. as_path = ‘‘“), you must use the = or <> operator. LIKE and NOT LIKE do not work for matching against nulls.

 

 
 top

Aggregation-related Subqueries

Special considerations related to working with KDE functions for aggregation and grouping are covered in the following topics:

 

 
 top  |  section

Supported Aggregation Functions

The following functions are supported for aggregation in KDE:

Aggregation level Supported functions
Subquery SUM, COUNT, MIN, MAX, COUNTDISTINCT, MEDIAN
Final All standard PostgreSQL aggregation functions, including:
SUM, MIN, MAX, COUNT

Note: Averaging should be performed by applying a mathematical formula (e.g. sum(in_pkts)/60 for packets per minute) rather than by using SQL AVG functions.

 

 
 top  |  section

Aggregation Pairing

Subquery aggregation functions are typically equally paired with SQL aggregation. That means that the column used for sub-query aggregation must match the column used for SQL aggregation.

This example shows correct pairing, with matching column name (in_bytes):

SELECT sum(in_bytes) AS f_sum_in_bytes

This example shows incorrect pairing, because the function column (in_bytes) is not matched to the SQL column (in_pkts):

SELECT sum(in_pkts) AS f_sum_in_bytes


 

 
 top  |  section

Functions in Multiple Statements

When a subquery aggregation function is called on given column in one SELECT statement, the same function is automatically applied to subsequent SELECT statements on the same column in that query without the function being called explicitly each time. This allows multiple SELECT statements performing the same aggregation on the same column to be combined into a single statement.

In the first example below, the function (f_sum_in_bytes) called on the in_bytes column in the first SELECT statement will be applied as well to the sum operation in the second SELECT statement. In the second example the unit of display is specified in the second statement as mbps. In the third example the two statements are combined into one:

-- Function in first SELECT statement will be applied to second:
SELECT sum(IN_bytes) AS f_sum_IN_bytes,
  ((sum(IN_bytes)*8)/60)/1000000
FROM ALL_devices
WHERE ctimestamp > 60;
-- Second SELECT statement specifies name of returned column as "mpbs":
SELECT sum(IN_bytes) AS f_sum_IN_bytes,
  ((sum(IN_bytes)*8)/60)/1000000 AS mbps
FROM ALL_devices
WHERE ctimestamp > 60;
-- SELECT statements are combined into one:
SELECT ((sum(IN_bytes)*8)/60)/1000000 AS f_sum_IN_bytes$mbps
FROM ALL_devices
WHERE ctimestamp > 60;


 

 
 top  |  section

Aggregation Sample Correction

Devices reporting flow data (e.g. NetFlow) to Kentik Detect are typically configured to sample the underlying flow at a given sample rate (e.g. 1 in 1024 flows or 1 in 1024 packets). When an aggregation function is run on data stored in KDE the returned bytes and packets are corrected for this sample rate. When data from the same source is not aggregated and is returned instead on a per-row basis, that data is not sample-corrected.

The results returned from the SUM aggregation function in this example will be corrected for the sampling rates of the summed data:

SELECT sum(in_pkts) AS f_sum_in_pkts
FROM all_devices
WHERE ctimestamp > 10;

The results returned from the query in this example, which doesn’t use an aggregation function, will not be corrected for the sampling rates of the data in the selected columns:

SELECT ipv4_src_addr,
  ipv4_dst_addr,
  in_pkts
FROM all_devices
WHERE ctimestamp > 10
LIMIT 10;


 

 
 top  |  section

Using the HAVING Clause

The HAVING clause allows you to filter the output based on the result of an aggregation function. For example, the following will display all IPs that exceeded 5000 pps UDP over the last 60 seconds:

SELECT ipv4_dst_addr,
  round(sum(in_pkts)/60) AS f_sum_in_pkts$_pps_
FROM all_devices
WHERE protocol = 17
  AND ctimestamp > 60
GROUP by ipv4_dst_addr HAVING sum(in_pkts)/60 > 5000
LIMIT 100;


 

 
 top

CIDR-related Subqueries

The KDE interface middleware includes special functions that enable queries to incorporate network CIDR (Classless Inter-Domain Routing) blocks, which are blocks of addresses grouped into a single routing table entry. These functions allow queries to use CIDR notation to specify the IP address and associated routing prefix of a block. The use of CIDR in KDE is covered in the following topics:

 

 
 top  |  section

CIDR Notation

CIDR notation is made up of the IP address (IPv4 or IPv6), followed by a slash (“/”) separator and then a LEN value. LEN represents a decimal number indicating the prefix length, which is the number of shared initial bits, counting from the most-significant bit of the address.

In a CIDR block representing addresses in the range from 198.204.100.0 to 198.204.100.255, for example, the first 24 bits (three 8-bit values: 198, 204, and 100, respectively) of the mask are shared (LEN = 24). That allows this range of addresses be represented in CIDR notation as the block at 198.204.100.0/24.

 

 
 top  |  section

Filter by CIDR

When processing a query involving foreign data (SQL/MED), the expectation at the top (SQL) level of the KDE interface is that the processing of any WHERE clause will be handled by the database that is the source of the foreign data. This is what allows the use of CIDR in KDE — where the CIDR notation function is handled by the middleware — even though CIDR is not supported in SQL itself and the CIDR notation would otherwise be interpreted as a literal string.

KDE’s middleware support for CIDR allows the use of a standard WHERE clause to filter by a specified CIDR block. The following query, for example, would return a SUM aggregation (in a column labeled MBps) of all traffic in the last 60 seconds sourced from all addresses within the specified CIDR block, whose mask begins with 198.204 (16 shared bits):

SELECT ((sum(in_bytes)*8)/60)/1000000 AS f_sum_in_bytes$MBps
FROM all_devices
WHERE ctimestamp > 60
  AND ipv4_src_addr = '198.204.0.0/16';


 

 
 top  |  section

Group by CIDR

The KDE interface allows traffic to be grouped into “chunks” of CIDRs using functions constructed with “cidr” plus LEN, e.g. “cidr24.” The function is used in a GROUP clause to group the data, and again in an AS clause to display the grouped data.

In the following example, data on all traffic is grouped into blocks of destination IPs that share the most significant 24 bits. The traffic for each block in the last 60 seconds is displayed in kbps:

SELECT ipv4_dst_addr AS f_cidr24_ipv4_dst_addr,
  round(((sum(in_bytes)*8)/60)/1000) AS f_sum_in_bytes
FROM all_devices
WHERE ctimestamp > 60
GROUP by f_cidr24_ipv4_dst_addr
ORDER by ipv4_dst_addr
LIMIT 200;


 

 
 top  |  section

CIDR-Filtered Grouping

As noted in Filter by CIDR, CIDR notation is interpreted in SQL as a literal string rather than as a variable representing a range of addresses. Seeing what appears to be only a single address in the WHERE clause, the SQL engine won’t execute the GROUP clause because it concludes that all results would be in the same group. To work around this special case, the KDE interface supports the use of ˜˜ (double-tilde) as a special operator to be used in place of = in a WHERE clause. If the object of a WHERE clause is a string in CIDR notation then using the ˜˜ operator tells the KDE middleware to force execution of the GROUP clause.

The following example shows a query in which the double-tilde will force execution of a GROUP clause that would otherwise be skipped because the CIDR-notated address range in the WHERE clause would be interpreted by the SQL engine as a single address:

SELECT ipv4_dst_addr AS f_cidr24_ipv4_dst_addr,
  round(((sum(in_bytes)*8)/60)/1000) AS f_sum_in_bytes
FROM all_devices
WHERE ctimestamp > 60
  AND ipv4_dst_addr ˜˜ '8.0.0.0/8'
GROUP by f_cidr24_ipv4_dst_addr
ORDER by ipv4_dst_addr
LIMIT 200;

 

In this article: