|Note: Capacity Planning is currently a beta feature. Please report any issues to firstname.lastname@example.org.
This how-to looks at how to use Kentik Detect to assess the utilization of capacity across the interfaces of your network infrastructure:
Capacity planning for a network involves figuring out when links, routers, switches, firewalls, and other network infrastructure are approaching the limits of their capacity. In many cases, capacity planning has traditionally been a completely manual process with complex spreadsheets to pull in the data, aggregate it, run statistics on it, and trend it over time.
Typical capacity planning tools are a step up from a totally manual approach. They collect counter data and let you set an alert on a static threshold, but they are restricted to an interface-level view, which leaves out important information including:
- Business insights that service providers need:
- Total capacity to a given external network
- Total capacity in or out of a given market
- The type of connectivity, e.g. transit, backbone, paid peering, or free peering.
- Operational insights needed by large enterprises:
- WAN usage
- ISP uplink capacity
- East-West datacenter hotspots
- Adequacy of inter-datacenter links.
In Kentik Detect, Capacity Planning goes beyond legacy tools to provide integrated insights into network capacity, utilization, performance, and traffic composition. The Capacity Planning page (Analytics » Capacity) provides users with a quick view of the link utilization across the network — filtered to expose the most urgent issues — and projects the date on which, if current trends hold, each interface will reach a user-specified percent of capacity. These capabilities enable service providers and other network operators to minimize costs while delivering the best possible service.
To illustrate the use of Capacity Planning, we’ll start by creating an example report with the settings in the sidebar as shown below.
Options pane settings:
- In Dimensions: Source Interface (default)
- Out Dimensions: Destination Interface (default)
- Show Device-level Data: On, so the output will show not only the interface name but also the name of the device.
- Show Trend Analytics: On, so we can track trends in the data and enable projection of a “runout” date.
- Set Target Runout: 95% (default), meaning that a capacity is defined as having run out when utilization hits 95%.
- Baseline interval: Month (default), so the trends will be expressed by month (i.e. MoM for Month over Month).
- Max Runout Date: 2017-12-31, so that runouts and warnings will be shown based only on trends through the end of 2017 (this example is from Fall 2017).
Thresholds pane settings:
- Severity Thresholds: Set the percent of capacity for the following:
- Warning (lower threshold): 70%
- Critical (higher threshold): 90%
- Display Threshold (minimum utilization to appear on the report): 20%.
For Time, Devices, and Filtering panes, leave all settings as defaults:
- Time: Lookback window of one week.
- Devices: All devices.
- Filtering: No filters.
When we’re done with our settings, we run a report by clicking the blue Run Report button at the top of the sidebar. With the Group-by selector at upper right set to Sites, the report is sorted so that all interfaces from the devices assigned to a given site (i.e. PoP or router group) are grouped together in the Capacity List (the table of results). Looking at the columns of the table, we can see that see that utilization is measured three ways: Average, 98th Percentile, and Max. Based on the thresholds that we set, we can quickly see which links currently have warning and critical utilization levels in each of the three categories.
For a look at results from one interface, we’ll focus on the second row (outlined in image below) in the section for the Ashburn, VA site, where there’s a device called pe1.iad. This interface has a link called ae7 (an aggregated Ethernet bundle), which is a backbone type of link with a capacity — automatically discovered by Kentik via SNMP — of 30 Gbps.
In the close-up below of this part of the ae7 row, we can see how the traffic on this link is projected to grow over time. Each of the cells in these columns contains three lines of information. The first line shows that average traffic volume appears to be OK (40% of capacity) but traffic peaks, both 98th percentile and Max, are above our warning threshold. In the second line, we see the MoM (month-over-month) trend, which shows growth rates of 11% and 10% respectively for 98th percentile and Max. In the third line, meanwhile, we see the dates on which growth at these projected rates would result in traffic that exceeds the runout target (95% of capacity) that we set above.
The estimation of runout dates provides advance warning of when we need to either deploy an upgrade or shift traffic. In this case, the 98th percentile is showing a run-out date of December 11, and the maximum is showing December 6. An action plan is definitely needed by the beginning of December to avoid congestion on this interface. Without these prioritized, automatic projections for every interface in the network, capacity planners would need very complex spreadsheets to arrive at these runout dates.
The ability to run a report manually is great, but we can be more proactive by incorporating capacity analytics into a Kentik Detect alert policy, which will give us an automatic notification when an interface is nearing its capacity. To make this capability easy to implement you’ll find two ready-made policies in our Alerting system’s Policy Library (Alerting » Library):
- Inbound Interface near capacity
- Outbound Interface near capacity
To use either preset, click on the Copy button at the right of the policy’s row in the list, which will add an editable version of the preset to your organization’s policies (Alerting » Policies). By default, these policies are set to trigger an alarm if any interface has at least 700 Mbps of traffic and is at more than 85% utilization. To tailor these settings to your infrastructure, click on the alert in the Policy List. In the resulting Edit Alert Policy dialog, go to the Alert Thresholds tab and adjust the settings in the Conditions pane.
At the bottom of the Alert Thresholds tab you’ll find the Notifications pane, where you can configure how you want to be notified; options include email, Slack, PagerDuty, Syslog message, or JSON Post to a Web page. This gives you the ability to tie the alerts to an existing capacity management system if you already have one in place.
We’ll also want to set which dashboard is opened when you click on the Open in Dashboard button (in the Actions column of the list on the Active or History page) for any alarm that is generated by this policy. Still in the dialog, go to the General Settings tab, click on the Policy Dashboard field, and choose the “Capacity Management” dashboard, which is one of Kentik’s dashboard presets.
Once you have an alert set up and enabled, you’ll receive notifications when an interface nears capacity. At that point you’ll probably want to investigate top talkers on that link to understand why bandwidth consumption has jumped:
- For enterprises, you might want to see if there is a misconfiguration or misuse.
- For service providers, you might want to see if a single customer or CDN is using the majority of the capacity on the link.
Alert policies generate notifications when conditions match those specified in policy thresholds, at which point the alert enters ALARM state. Alarms are listed in the Alerts List on the Active page (Alerting » Active), with each alarm shown as a single row. When you click the Open in Explorer button at the right of a given row, Data Explorer will open in a new browser window or tab, with the sidebar set to correspond to the values of the alarm’s key (for example, if the key’s dimension is Destination IP and the value of the key in the alarm is 220.127.116.11 then there will be a filter in the Data Explorer sidebar for inet_dst_addr ILIKE 18.104.22.168). The display area will show a graph (like the image below) of the results of the query run with those sidebar settings.
To dive deeper into the traffic behind the utilization bump, use the Group By Dimensions field in the Query pane of the Explorer sidebar to add dimensions corresponding to additional traffic details. A good place to start is by adding Source IP/CIDR, Destination IP/CIDR, and Destination Protocol IP Port to see your top IP talkers on the link.
So far we’ve talked about exploring capacity on an ad-hoc basis using the Capacity Planning page, and also about getting alert notifications when a capacity threshold has been triggered. Another aspect of planning would be to get regular reports on capacity utilization across our infrastructure. We can do this using the reporting feature of Kentik Detect dashboards.
You will recall that when we set up alerting on capacity, we associated the alert with a Capacity Management dashboard. To access this dashboard, click on Dashboards on the portal navbar. On the Dashboards page, set the Show selector to Presets, then start typing “capacity” into the Filter field. When you see the tile titled Capacity Management, click on the title to open the dashboard.
The panels in this dashboard are configured to show interfaces categorized by direction (source or destination) and capacity (1G, 10G, 20G, 30G, 40G, and 100G). The default time range for the information is the past one day (specified in the Time pane of the sidebar, which is collapsed by default). Scrolling through the dashboard you can see which interfaces in each category are closest to capacity.
The collection of panels in the dashboard can be exported as a PDF by clicking the Export icon at the upper right of the main dashboard display area. You can also create a subscription (Admin » Subscriptions) that will automatically send a report (PDF) based on the dashboard to a specified list of recipient emails.