Last updated 01 November 2016
Table of Contents
Application-level metrics help developers investigate and diagnose issues with their applications running on Heroku.
Application metrics are only available to apps that are using
hobby, professional (
performance), or private dynos. Applications using
free dynos do not have access to application metrics. Also dynos on legacy pricing plans may not have access to some newer metrics features, such as high-resolution metrics. Not all Application Metrics features are available to all dyno types. Distinctions in feature availability are noted where applicable.
This document describes Application Metrics production features. Additional experimental features are covered on the Application Metrics beta channel page.
To view application metrics, navigate to your app in the Heroku Dashboard and click the Metrics tab. Plot views can be toggled between the default horizontal stacked layout and a compact multi-axes stacked layout. Hovering over a time point on a plot will provide you with the measurement(s) for that time interval.
Individual metrics are gathered per process type.
There are 3 levels of data resolution in Application Metrics:
- 1 minute (2 hour view)
- 10 minute (24, 24-28 and 48-72 hour views)
- 1 hour (72 hour combined view)
Values represent rollups of data point for the sampling window. Hobby dynos only have access to 24 hours of data at 10 minute resolution. Resource utilization values represent maxima or averages per process type.
Metrics gathered for web dynos only
The following metrics are gathered for only the
web process type:
- Median: The median response time (50th percentile) of HTTP requests within the specified sampling interval (10 or 1 minute). This means that 50% of an application’s web requests were completed within less time than the median, and 50% were completed within more.
- 95th Percentile: The 95th percentile response time of HTTP requests within the specified sampling interval. This means that 95% of an application’s web requests were completed within less time, and 5% were completed within more. This is helpful for providing an upper bound (but not maximum) for expected response times.
- OK: The number of successful (status codes < 500) requests serviced per minute. For a 10 minute rollup, this is the total number of successful requests divided by 10 to provide per-minute values.
- Failed: The number of failed (status codes >= 500) requests serviced per minute. For a 10 minute rollup, this is the total number of failed requests divided by 10 to provide per-minute values.
Metrics gathered for all dynos
The following metrics are gathered for all process types, and are averages of the metrics of the dynos of that process type for a given application:
Maximum overall memory usage is displayed as a single stacked plot, combining maximum rss and maximum swap memory as reported for 10 or 1 minute increments. Mean total memory (rss + swap) is displayed as a blue line. Memory quota is depicted as a dashed gray line with any quota breaches flagged in red. The latest, mean, and max percent memory are shown for the selected time interval (e.g. 24 hours), along with the raw value.
- RSS: The amount of memory (megabytes) held in RAM across dynos of a given process type. Max RSS is reported for each 10 or 1 minute interval.
- Swap: The portion of a dyno’s memory, in megabytes, stored on disk. It’s normal for an app to use a few megabytes of swap per dyno. Higher levels of swap usage though may indicate too much memory usage when compared to the dyno size. This can lead to slow response times and should be avoided. Max swap is reported for each 10 or 1 minute interval.
- Total Memory: Mean total memory represents the portion of memory which users can optimize and is shown as the sum of rss and swap as measured in 10 or 1 minute increments and averaged across all dynos.
- Memory Quota: The maximum amount of RAM available to your selected dyno type, above which an R14 memory error would be triggered.
- 1m Load Average: For 10 minute sampling intervals the mean of the 1 minute load average for each 10 minute period is shown. For 1 minute windows the 1 minute load average is directly displayed. The load average reflects the number of CPU tasks that are in the ready queue (i.e. waiting to be processed) expressed as an exponentially dampened average over the past 30 minutes.
- 1m Load Max: For 10 minute windows this is the maximum value of the 1 minute load average for the time period. For 1 minute intervals the maximum load average from 20 second sampling intervals is displayed.
The Events table contains Heroku errors and user-initiated events that influence application health. Currently tracked user activities include deployments, configuration changes, and restarts. Activity events (such as deployments and configuration changes) are displayed in blue. Color gradients indicate the relative number of events of each type that occurred during each time interval since at most only one marker of the event type is displayed per time interval. Additional details are available by hovering over the specific event. These details include error descriptions, and for user-initiated events who made the change and what happened.
Critical errors are displayed in red, warning level errors in orange, and informational errors in gray.
Configuration Variable Change Events
Changes to configuration variables are also captured as events, with the variable that changed shown in the event details.
Deployment Events and Markers
Deploys are also displayed in the Events chart. Deployment activities are extended onto the Metrics plots as deployment markers to help users visualize the impact of deployments on application health.
Scaling events represent horizontal and/or vertical dyno scaling activities.
There are three categories of restart events displayed on the Events chart:
- user initiated, including manual restarts and restarts associated with deployments, configuration changes, and dyno type changes
- platform initiated, daily scheduled dyno restart (shown as daily restarts in the logs)
- platform initiated restarts after unexpected runtime crashes (shown as relocations in the logs)
In addition to raw metrics, Heroku provides online notifications of specific conditions that might be indicative of problems with your application. Links to relevant Dev Center articles are included to provide recommendations on how to correct the problem. Language-specific guidance is provided where available. The list of alerts provided is constantly evolving as we gather more data about application behavior, but examples include alerts on memory errors, request timeouts, and slow response times.
Metrics for App Favorites
24 hour summary web metrics and sparklines are also displayed for each favorited app in the default Heroku Dashboard view. Summary metrics include the total number of dyno and router errors, and the most recent 95th percentile response time and throughput value based on 10 min resolution. Only apps with web dynos will be displayed. Other basic information about your app, including dyno formation/location, most recent deployment, and language, is also shown.
The Threshold Alerting feature is available to apps running on Professional dynos. It allows you to specify limits on web dyno 95th percentile response time and the percentage of failed requests above which an alert will be triggered. Email, PagerDuty, and dashboard notifications are supported.
To set up an alert select “Configure Alerts” to open the Alert Setup dialog.
PagerDuty and Additional Email Configuration (optional)
By default the distribution for email notifications is all app owners and collaborators for non-org accounts, and admins for those in a Heroku Enterprise org. For email-based PagerDuty integration first create a new PagerDuty service (or use an existing one) with your preferred escalation policies following PagerDuty’s instructions. Enter the email address you’ll be using; this is the email from which PagerDuty will use to create incidents and the email Heroku will use for email notifications. Set up the following PagerDuty rules to parse your alerting email notifications:
The following steps are the same for both PagerDuty integration and additional email setup. Select “Add Email for Alert Notifications” and enter the PagerDuty or additional email. A code will be sent to PagerDuty Incidents or additional email, respectively, for confirmation. Enter this code in the Alerting Setup to continue. One additional email is supported per app.
Select the metric(s) that you wish to monitor, “Response Time” and/or “Failed Responses”. Adjust the threshold and sensitivity as appropriate. Note that for response time the minimum threshold is 50ms. The sensitivity is the duration an error state must occur prior to triggering an alert. The Alert Simulation shows you how many alerts would have been triggered in the past 24 hours for the app with the selected settings.
A summary of alert configuration and state will appear below the corresponding metrics plot, with the option to edit the existing setup.
Specify whether you would like to receive email notifications (default or PagerDuty/additional), and if so, the notification frequency for active alerts. Leaving both boxes unchecked results in silent (dashboard only) notifications.
Lastly, activate the alert and select “Confirm”.
Dashboard alert notifications appear in multiple locations in the Heroku dashboard, including:
- the Metrics Events table
- the corresponding Metrics plot
- the apps list
- app headers
The icon for an actively alerting app is denoted with a red diamond symbol.
With email notifications an initial email is sent once an alert is triggered. For active alerts emails will be sent at the frequency specified in your delivery preferences. A final alert notification is sent when the error state is resolved.
For those using Heroku Enterprise orgs the operate permission is required to configure and view alerts in Application Metrics.