Reviewing Your Key Application Performance Metrics
Last updated May 27, 2020
This is a draft article - the text and URL may change in the future. This article is unlisted. Only those with the link can access it.
Table of Contents
- Introduction
- Pre-requisite monitoring setup
- Start a review document
- Record your resources and configuration
- Use the Production Check feature
- Record your errors and events
- Record your response times and throughput
- Identify slow transactions
- Record your dyno load and memory usage
- Run pg:diagnose
- Record your Postgres load
- Record your number of connections
- Record your cache hit rates and IOPs
- Record your database size
- Identify your expensive queries
- Record language-specific or other key metrics
- (Optional) Compare to previous review
- Next steps
Introduction
In this tutorial, you will:
- Review and identify patterns in your baseline metrics
- Identify items to investigate for remediation or optimization
These checks will reveal potential performance bottlenecks. The Next Steps section includes guidance for resolving the issues you find.
Pre-requisite monitoring setup
While the Metrics tab of your Heroku Dashboard and data.heroku.com provide a picture of the overall health of your application and database, additional tools are required to get a more complete metrics review.
Install these tools at least 7 days before using this tutorial so you have enough data to identify issues.
- An app deployed to Heroku on non-free dynos that is receiving traffic. Application metrics are not available for free plans.
- A non-hobby-tier Heroku Postgres database. Postgres metrics are not available for hobby plans.
- A logging add-on, to view app and database logs from the test
- An application performance monitoring (APM) add-on, such as New Relic, Scout, or AppOptics, to identify slow endpoints and services
- An infrastructure monitoring add-on, such as Librato or AppSignal, to measure load on the app and database
- log-runtime-metrics enabled
- (Optional) language runtime metrics enabled
This tutorial includes screenshots from a variety of monitoring tools. Use the tools that work best for your app.
Start a review document
Start a document to capture your observations.
Set your Heroku Dashboard Metrics tab and all monitoring tools to the same timezone and units of measure (i.e. rpm vs. rps) for easier reference.
Set your monitoring tools to look at the same time period, i.e. the last 7 days of history. Confirm that your selected time period is typical for your app. Note your selected in your review document.
At the bottom of your review document, add a section called “Items to Investigate.” You will add items to this section as you record observations about your metrics. Later, you will dive deeper into the items you have flagged for investigation.
Record your resources and configuration
In this step, you will record relevant configuration info for your resources. This allows you to interpret your metrics in the context of your app’s current configuration.
Record the following info, adjusting for the specifics of your app:
Resource | Version/Plan (examples shown) | Config (examples shown) |
---|---|---|
Stack | heroku-18 | |
Region | U.S. Common Runtime | |
Web Dynos | Performance-M | 1-4 dynos (autoscaling enabled - p95 = 800ms threshold) |
Web Server | Puma 4.3 | WEB_CURRENCY = 2 |
Framework | Rails 5.2.3 | RAILS_MAX_THREADS = 5 , pool (from database.yml): 5 |
Database | Postgres 11.6: Standard-0 | Attached as DATABASE_URL ,HEROKU_POSTGRESQL_SILVER_URL |
Other | Other resources and add-ons, i.e. monitoring tools, worker dynos, Heroku Scheduler, etc |
Use the Production Check feature
Use the Production Check feature on your Heroku Dashboard. Take a screenshot of it and include it in your review document. Add any warnings to your list of “Items to Investigate.”
Record your errors and events
In this step, you will gather the error and event info concerning your app. These include events such as app deploys, dyno formation changes, etc. This provides additional context as you interpret your metrics.
In your Heroku Dashboard, go to your Metrics
tab and scroll to the Events section. Take a screenshot and add that to your review document.
If your monitoring tool includes an error analytics feature, also record that info in your review document.
Write a description for your screenshot(s). Take a note of the following:
- When you deployed and a description of what was deployed, i.e. link to the merged pull request.
- Changes to your dyno formation so that you know how many dynos you were running throughout your observation timeframe
- The pattern of your daily dyno restarts. If any daily restarts occur outside of low traffic periods, add an item to your “Items to Investigate” section at the end of your review document.
- Frequency and type of errors. Add these errors to your “Items to Investigate” list.
- Any incidents that occurred, along with links from status.heroku.com that detail the incidents. You may find that these incidents account for anomalies you may encounter as you review your metrics.
Record your response times and throughput
From your Heroku Dashboard Metrics
tab or your monitoring tool, take a screenshot of your throughput and response time and add them to your review document.
On the Heroku Dashboard Metrics
tab, you can select and de-select what is shown by clicking on the legend to the right of the graph. For example, just the p50 response times for the example application are shown below:
Response times for your web application should be under 500ms.
The following signs point to potential concurrency issues:
- Response times increase when throughput increases
- Response times increase for a sustained period after a deploy or scaling event
If you notice these in the metrics graphs, use your monitoring tool to examine request queuing time. High queue times indicate there is slow code or insufficiency in web concurrency or the number of running processes.
High queue time indicates that your app is unable to handle the volume of requests, causing those requests to back up in the router.
This example graph from Scout APM that includes queue time in pink is shown below:
The above screenshot shows elevated queue times. Sometimes the higher queue times are accompanied by higher throughput, indicating insufficient concurrency or dyno count for the amount of traffic.
Make some observations about any patterns you see. Record the following in your review document. Example info is shown for reference:
Metric | Suggested Baseline | Your Baseline | Comments |
---|---|---|---|
Avg p50 response time | < 500 ms | 135.4 ms | Response times are higher from Friday through midday Monday. Fewer dynos were run during this time than other times during the week. |
Avg p95 response time | 500 ms | 418.9 ms | |
Avg p99 response time | 1000 ms | 1087 ms | There are some large spikes in p99 response times. These appear to match up to timestamps for H12 Request Timeout errors. |
Avg throughput | — | 1422 rpm | |
Max throughput | — | 2248 rpm | Queue time jumps up whenever throughput is above 2000 rpm, though higher queue times do not always match higher throughput. |
In the next couple of steps, you will identify some slow transactions and look at your web dyno utilization and concurrency.
Identify slow transactions
A monitoring tool with transaction tracing capabilities is invaluable for identifying slow transactions.
If your p95/p99 response times are slow, or you are experiencing H12 errors, you should examine your transactions that have the slowest average response times.
Please consult your monitoring tool’s documentation for steps on identifying these transactions. Here are the transactions listed as the slowest average response time for the example application, as shown in New Relic:
Add these transactions to add to the “Items to Investigate” list in your review document.
Record your dyno load and memory usage
In this step, you will take screenshots and make observations about your dyno load and memory.
Determining the correct number and type of dynos and the most effective web concurrency is a non-trivial task. You will first start by looking at your web dyno load and memory to check if your dynos are currently over- or underutilized.
In your Heroku Dashboard Metrics
tab or your monitoring tool, take screenshots of these two metrics:
The best practices guidance for max dyno load depends on what dyno type you use. Please consult this Knowledge Base article to find the recommended load for your dyno type. If you experience high load issues, you may want to investigate if it is possible to move CPU-heavy tasks into background jobs instead.
The recommendation for memory usage is to keep max memory usage below 85% of your memory quota. You can see memory limits for your dyno type here.
As you review the app’s memory behavior, some red flags are any swap usage, R14, and R15 errors. Total memory usage includes RAM and swap memory. It is reasonable to observe some swap (below 50 MB on Common Runtime), but if swap is significant, you may need to lower your web concurrency settings and add more dynos or increase dyno size. Note that Private Spaces dynos do not swap, but will instead restart the dyno.
Record your observations, along with comments on any patterns you see. A table with example info is shown below (Performance-M dynos are used in the example):
Metric | Suggested Baseline | Your Baseline | Comments |
---|---|---|---|
Avg dyno load | < 3.0 | 3.17 | Load is frequently above the recommended baseline. Slightly lower load experienced over the weekend through midday Monday. |
Max dyno load | 3.0 | 7.24 | Max load is more than double the recommended baseline. |
Avg memory usage | < 85% | 27.8% | Average and max memory usage are virtually the same except for a few spikes in max memory. Memory is generally underutilized. |
Max memory usage | 85% | 195.1% | Memory spikes match the timestamps for R14 errors. |
Max memory swap | < 50 MB | 2443 MB | |
Web concurrency settings | WEB_CURRENCY = 2 , RAILS_MAX_THREADS = 5 |
If your load or memory appears to be under- or over-utilized, add an item to your “Items to Investigate” section to look into optimizing dyno usage and web concurrency.
High memory usage can also be indicative of other memory issues such as memory leakage or bloat. You may want to add diagnosing memory issues to your “Items to Investigate.”
Different languages, web frameworks, and applications behave quite differently, making it difficult to offer specific advice. When you review your “Items to Investigate” in the last step of this tutorial, links are provided to help you with adjusting web concurrency and troubleshooting memory issues for a variety of languages and frameworks.
Run pg:diagnose
In this step, you will start looking at your database performance.
Your database is an important resource to monitor. pg:diagnose
performs a number of useful health and diagnostic checks that help analyze and optimize database performance.
Run pg:diagnose
and screenshot the output for your review document. Add anything that is flagged in red or yellow to your “Items to Investigate” list.
For example, from the output above, you would add the indexes listed as yellow as items to investigate further. Your investigation would include:
- confirming that the
Never Used Indexes
are also not used on any followers you may have - removing the
Low Scans, High Writes
indexes in your staging environment to determine impact before applying changes to your production database
For more ideas of what items to investigate, please take a look at the pg:diagnose
section of this article and review anything related to the red and yellow warnings in your output.
Record your Postgres load
In this step, you will record your Postgres load and make some observations..
A variety of metrics are made available in the logs. Although there are no graphs for these Postgres metrics in the Heroku Dashboard, a small selection of these metrics are available as graphs at data.heroku.com. Some monitoring tools do include additional graphs for this info. If your monitoring tool does not offer you a way to monitor Postgres, you may look into downloading logs from your logging add-on provider and using a tool like pgBadger to generate reports from them.
Take a screenshot of your database load. The following is an example from Librato:
A load average of 1.0 indicates that, on average, processes were requesting CPU resources for 100% of the timespan. This number includes I/O wait.
If your load is higher than 1, add to your “Items to Investigate” to see if workloads can be reduced, ensure you are managing your sessions correctly and look into connection pooling to reduce the overhead associated with connections. You can check the number of vCPUs available for your plan here.
Record your number of connections
Take screenshots of the number of your active and waiting connections. The following screenshots are from Librato:
Although max limits for connections are listed on this table, the actual max number of connections that can be made to your database server depends on other factors. Each connection costs more overhead and if your database server is already experiencing high load, it may not be able to reach the hard max connections limit.
A way to keep connection overhead down is to use connection pooling, which will reuse connections. If the number of your connections is approaching your connection limit or your load looks concerning, add an item to look into connection pooling to your “Items to Investigate” list.
Waiting connections are those waiting on database locks to proceed. Some lock waits are to be expected, but high numbers indicate an issue. Since Heroku metrics are only collected once every minute, consistent numbers seen for this metric can indicate a problem. If the number of waiting connections is greater than 0 for more than 5 minutes, add an item for examining database locking contention to your “Items to Investigate”.
Record your cache hit rates and IOPs
Take screenshots of your cache hit rates and IOPs from your monitoring tools. The following screenshots are from AppSignal.
Your cache hit rates should be above 0.99 or 99%. A value of 100% would indicate perfect cache utilization, with anything less than that indicating cache misses. However, cache misses do not necessarily mean that disk I/O was needed as Postgres also relies heavily on the file system’s memory cache.
Heroku Postgres instances allocate up to 8GB to the shared_buffers
cache, using the formula minimum_of(0.25 * Available RAM, 8GB)
. As such, the read-iops
metric is a better indicator of whether or not the amount of RAM provided by your current Postgres plan is sufficient to cache all of your regularly accessed data.
The read-iops
and write-iops
metrics track how many read and write I/O requests are made to the main database disk partition in values of IOPS (I/O Operations Per Second). Each Heroku PostgreSQL plan has a max Provisioned IOPS (PIOPS) configuration. PIOPS includes the total reads + writes per second that can be sustained by the provisioned disk volume. Ideally, you want your database reads to come from memory (cache) rather than disk, which is much slower.
If you have poor cache hit rates and consistent high read-iops
, it is time to upgrade to a larger Heroku Postgres plan to increase your cache size.
If you have high IOPS, you can tune your queries and indexes to reduce the amount of data being read by queries. If they are already optimized, and you are consistently going past your PIOPS limits, it may be time to upgrade to a plan with a higher PIOPS limit.
Record your database size
Your database size includes all table and index data on disk, including database bloat. You will want to monitor this to ensure you are not approaching plan limits. Take a screenshot of your database size. An example from AppSignal is included below:
If performance starts suffering as the database grows, check for bloat periodically using heroku pg:bloat
. Anything larger than bloat factor of 10 is worth investigating, however larger tables may have a low bloat factor but still take up a lot of space in bloat and require their vacuum thresholds to be adjusted from the default. Few monitoring tools offer graphs for bloat, but monitoring for database size may at least remind you to check for bloat when necessary. If you have a large database, add checking pg:bloat
to your list of “Items to Investigate”.
Identify your expensive queries
In this step, you will identify queries for optimization.
You can view expensive queries for your database at data.heroku.com. Select a database from the list and navigate to its Diagnose
tab. List these queries in your “Items to Investigate” section. You will need to look through your logs to find the exact parameters passed to these queries when you are investigating.
The pg:outliers
command from the Heroku pg-extras plug-in is also helpful for finding slow queries. Run that command to find queries that have a high proportion of execution time and record these in your “Items to Investigate” list.
Record language-specific or other key metrics
If you have language runtime metrics enabled or other metrics to monitor such as Heroku Redis, also take screenshots of those to add to your document. This tutorial does not detail what key metrics are for other resources. Please refer to each resource’s documentation page in Dev Center or elsewhere for guidance on key things to monitor.
(Optional) Compare to previous review
Note: if this is your first review, you will not have a previous review to compare with.
If you have previously completed a metrics review, you may want to compare your observations, make notes and come up with more action items.
For example, you may notice that although your database size is well within the limits of your plan, it has grown substantially since the review done the previous quarter. Your action items may include investigating if this growth is expected to continue so that you can better plan for future scalability.
Next steps
Interpreting metrics is more of an art than a science. When a metric appears high, you must consider your metrics in concert rather than isolation. As you completed each of the previous steps, you should have been filling out your “Items to Investigate” section. Take some time to summarize these items and add details about their next steps.
While your next steps will vary depending on what is required for your app, the following links may help you as you start your deeper investigation.
- Determining the correct number and type of dynos: see Optimizing Dyno Usage
- Concurrency: Understanding Concurrency plus specific articles on concurrency for Puma, Node, Gunicorn and PHP
- Memory Issues: Basic methodology for optimizing memory, plus specific articles for Ruby, Node, Java, PHP, Go and Tuning glibc Memory Behavior
-
Slow response times:
- Request Timeout
- Check your monitoring provider’s documentation on using transaction tracing in your tool. You can look at transaction traces in your monitoring tool to see if the slowness is caused by rendering, a call to an external API, a slow database query, etc.
- You may also want to rewrite your computationally intensive tasks as background jobs in order to keep request times short
- Slow queries: Expensive Queries, Efficient Use of PostgreSQL Indexes
- Database connection pooling: Client-side Postgres Connection Pooling, Concurrency and Database Connections in Ruby with ActiveRecord and Concurrency and Database Connections in Django.
-
Database locking:
pg:locks
- Database bloat: Managing VACUUM on Postgres
Heroku Enterprise customers can contact the Customer Solutions Architecture (CSA) team for help with interpreting your metrics or determining next steps. Many Heroku Enterprise customers are also eligible for deeper engagements with the CSA team such as App Assessments which include a review of your metrics and a full report of recommendations. An example App Assessment can be found here. To determine eligibility or request assistance with interpreting your metrics or determining next steps, please use the “Ask a CSA” link in our Enterprise Portal.