KDE Overview

This article covers the following topics related to querying the Kentik Data Engine (KDE), a datastore developed by Kentik to store current and historical flow data:

Notes:
- For information about the KDE main tables, including the columns from which Kentik Detect derives filters and group-by dimensions, see KDE Tables.
- For information about querying Kentik Data Engine see Querying KDE.

 

About Kentik Data Engine

Kentik Detect utilizes a custom-built scalable, columnar datastore called Kentik Data Engine. KDE ingests the flow data received from customers (e.g. NetFlow or sFlow), correlates it with additional data such as GeoIP and BGP, and stores it (redundantly) in flow records that can be queried from the Kentik Detect portal, via Kentik APIs (see Query API), or via a fully ANSI SQL compliant individual PostgreSQL interface.

KDE keeps a discrete set of databases for the flow records of each customer. These databases are made up of “Main Tables” for the flow records and associated data received for each device (router or host) as well as for supplemental data that is derived from the flow data. The columns of those Main Tables contain the data that is queried by Kentik Detect. Most of these columns are represented as dimensions used for filtering and group-by in queries. For a categorized list these dimensions, see Dimensions Reference.

 

Subqueries, Slices, and Shards

Note: This topic discusses Kentik's implementation of SQL as it relates to Kentik Detect's querying of the Kentik Data Engine and also to user querying of KDE via the Query SQL Method of the V5 Query API. Direct user querying of KDE using SQL (e.g. with a psql client) is deprecated. For additional information please contact Kentik support.

Kentik Data Engine presents to the user as a traditional (non-clustered) SQL database while using a scalable distributed cluster architecture under the hood. This architecture is built around key concepts that include subqueries, slices, shards, masters, and workers.

As a given customer’s flow records and associated data are ingested into KDE, the data from each device is stored in two parallel “main tables” (see KDE Tables), one for a full resolution dataseries and another for a “fast” dataseries, which is optimized for faster execution of queries covering long timespans (see Resolution Overview).

KDE main tables are each divided into logical “slices” of one minute (for Full dataseries) or one hour (for Fast dataseries). Every slice from a given device's main table corresponds to at least two identical "shards," each of which is an on-disk physical block in KDE. For high availability, the shards for a given slice are always stored on different machines (typically in different racks). The machines that store the shards are referred to as "workers."

The KDE subquery process is handled by both workers and "masters." For a given worker to handle a given subquery, a shard corresponding to the time-slice covered by that subquery must reside on that worker. Masters are responsible for splitting a query into subqueries that each correspond to a slice, for identifying (via metadata lookup) which workers have access to the shards needed for a given subquery, and for assigning each subquery to a worker. The worker generates a result set from data in a shard and returns that result to the issuing master, which reassembles them into final results that are returned to the top-level PostgreSQL interface as SQL/MED foreign data (see https://en.wikipedia.org/wiki/SQL/MED). This approach is part of what enables Data Engine to achieve exceptional performance with a SQL interface even when querying massive amounts of data.

The fact that queries to Data Engine will be broken into subqueries means that there are some special functions required to properly interact with the underlying architecture. These functions can affect the way that queries are written in the following situations:

Note: Kentik Data Engine does not currently support use of NOT IN in a WHERE clause.

 

KDE Resolution

At ingest, Kentik creates two fully independent dataseries that are stored in the KDE, one at full resolution and another optimized for faster execution of queries that cover long timespans:

  • Full dataseries: Includes every flow record sent by a given customer to Kentik (within applicable limits of the governing service agreement).
  • Fast dataseries: Includes only a subset of the flow records, enabling faster response to queries spanning 24 hours or more.

The use of parallel dataseries makes it feasible to run long-timespan queries that return in seconds without compromising the detail of shorter timespan queries. For further information about how the Fast dataseries is derived, how the query timespan determines the default dataseries for a query, and how to override the default, see Resolution Overview.

 

Table Time-slicing

Each Main Table and All Devices table in KDE is subdivided into time-slices that allow data to be spread across KDE for distributed processing of subqueries (see Subqueries, Slices, and Shards):

  • For tables in the Full dataseries (see KDE Resolution) the ingestion of flow data into a main table results in one row for each flow record, and the duration of the time-slice is one, five, or 10 minutes depending on the width of the query; see Time Rounding.
  • For Fast dataseries tables the time-slice is one hour (for information about how Fast data is derived, see About Dataseries Resolution).
 

KDE Query Efficiency

KDE stores a comprehensive range of network data and supports multiple ways for you to find the answers you need. In terms of performance, some approaches to querying KDE are more efficient than others. The following tips will help ensure that your queries are processed efficiently, enabling results to return as quickly as possible:

  • Use the device selector: Wherever possible, use the Selected Devices Dialog to choose the devices (whether individually, by site, or by label) from which you want to return results. This approach is much more efficient than choosing All Devices and using filters to narrow the results to specific devices. Only use All Devices when you really mean “all devices.”
  • Favor native dimensions: Some of the dimensions used for group-by and filtering are literally stored in KDE (native); others are derived at query-time (virtual) from other KDE-stored information. Native dimensions are more efficient for KDE to process than virtual dimensions, so the more that your queries are based on native dimensions the faster they will run. You can check whether a given dimension is native or virtual by consulting the KB topic Dimensions Reference.
  • Apply tags for recurring use cases: KDE columns for Flow Tags and Custom Dimensions are populated at ingest and are thus native. When you build a query that you expect to run repeatedly (e.g. for a dashboard, saved view, or alert policy) it’s more efficient to use tags or custom dimensions to narrow results than to use filters (which are applied at time-of-query).
 

Populating BGP Fields

When a flow is ingested into a device's KDE main table, the way the flow record's BGP fields are populated depends on the following three conditions at time-of-ingest:

  1. Is the device enabled for peering with Kentik Detect?
  2. Does the BGP routing table obtained by peering with the device include an IP route for the received flow?
  3. Is the AS path (list of ASNs) in the IP route empty or not-empty?

The following table shows how the conditions above determine the values written to the BGP-related fields in a given device's main Data Engine table:

Conditions: State A State B State C State D
1. Peering Enabled Enabled Enabled Not enabled
2. IP route for flow Yes Yes No N.A.
3. AS path Non-empty string Empty string N.A. N.A.
Fields:        
src_bgp_aspath
dst_bgp_aspath
The BGP AS path in the route. The ASN of the peering device. Empty string Empty string
src_as
dst_as
The last ASN in BGP AS path. The ASN of the peering device. The ASN of the corresponding (dst or src) IP address. The ASN of the corresponding (dst or src) IP address.
i_src_as_name
i_dst_as_name
The AS name corresponding to the last ASN in BGP AS path. The AS name corresponding to the ASN of the peering device. The AS name corresponding to the (dst or src) ASN. The AS name corresponding to the (dst or src) ASN.
src_nexthop_as
dst_nexthop_as
The first ASN in BGP AS path. The ASN of the peering device. 0 0
i_src_nexthop_as_name
i_dst_nexthop_as_name
The name corresponding to the first ASN in BGP AS path. The AS name corresponding to the ASN of the peering device. Empty string Empty string
src_second_asn
dst_second_asn
The ASN in the second position of the BGP AS path. 0 0 0
i_src_second_asn_name
i_dst_second_asn_name
The name corresponding to the ASN in the second position of the BGP AS path. Empty string Empty string Empty string
src_third_asn
dst_third_asn
The ASN in the third position of the BGP AS path. 0 0 0
i_src_third_asn_name
i_dst_third_asn_name
The name corresponding to the ASN in the third position of the BGP AS path. Empty string Empty string Empty string
src_bgp_community
dst_bgp_community
BGP community of the router. BGP community of the route. Empty string Empty string

 

KDE Access via PostgreSQL

Note: psql access to KDE is now deprecated. For additional information please contact Kentik support.

In addition to accessing your KDE-stored data via the Kentik Detect portal or the Kentik APIs (see Query API), your organization may still be able to connect to KDE via a PostgreSQL client (ANSI SQL compliant), a feature that is now deprecated. To do so you'll need the information shown below, which is unique to each organization but the same for all users within each organization. The information is available from the Kentik Data Engine (KDE) page of your User Profile (choose My Profile from the drop-down menu at the far right of the Kentik Detect navbar):

  • KDE host:port: The domain name of the Kentik Detect host used for your organization, and the Kentik-assigned port through which users registered to your organization access the KDE datastore.
  • KDE user: A Kentik-assigned unique identifier for your organization.

Using the above information, build a command line with the following pattern:

psql -h {host} -p {port} -U {user} -W

When you execute the command, a prompt will request a password, which is actually the name of your organization's database in KDE. If this value has been set, you can find it on the tab. The value is the string following (but not including) the -W in the field labeled "PSQL command line."

Note: The prompt response is not the same as the string in the field labeled "KDE password."

© 2014- Kentik
In this article:
×