Optimizing Dyno Usage
Last updated May 24, 2023
Table of Contents
A fundamental aspect to optimizing any application is to ensure it is architected appropriately. For example, it should use background jobs for computationally intensive tasks in order to keep request times short, and use a process model to ensure that separate parts of the application can be scaled independently.
Beyond this, you may reach a point where you need to scale or optimize by making more efficient use of available resources. For example, if your web requests are short and handled efficiently, you could be able to increase throughput on a dyno by increasing the ability of the web server to handle more requests concurrently, usually at the expense of using more RAM.
This article provides a bird’s-eye view of how to go about optimizing an application for the various dyno types. It provides some rough estimates of capabilities, and pays particular attention to memory usage and concurrency. The techniques suggested in this article are relevant to any environment that runs your application, not just a dyno. For specific guidance on minimizing your costs for different environments, see Optimizing Resource Costs.
Heroku Enterprise customers with Premier or Signature Success Plans can request in-depth guidance on this topic from the Customer Solutions Architecture (CSA) team. Learn more about Expert Coaching Sessions here or contact your Salesforce account executive.
Considering different dyno types
Heroku offers a range of dyno types. Each type has a different CPU and RAM profile.
Changing the dyno type of an application increases complexity: as a developer you have introduced a new variable, the type of the dyno, in addition to the number of dynos.
However, a well designed app will quite naturally be able to make use of different dyno types, and thinking about optimizing your application to make better use of a dyno is a worthwhile endeavor.
Even if your application doesn’t need to make use of different dyno types, consider applying these optimization techniques to your current dyno type anyway.
The different dyno types offer three important axes of optimization: CPU, RAM and the performance profile.
Most applications are not CPU-bound on the web server.
If you are processing individual requests slowly due to CPU or other shared resource constraints (such as database), then optimizing concurrency on the dyno may not help your application’s throughput at all. Put another way, if your application is slow when there is little traffic, the techniques in this article may not increase performance.
The different dyno types do offer different CPU performance characteristics, and will aid a little in a high-CPU situations, but ideally you should consider offloading work to a background worker as a first step in optimization, as well as optimizing the code.
A final aspect of CPU is the number of cores. The different dyno types,
performance in particular, offer multiple cores. With multiple cores, you may be able to execute multiple threads in parallel. This article points out where you need to take action to make use of these cores.
The rest of this article will assume the application is not CPU-bound.
Depending on language and web framework, there is typically a direct correspondence between RAM and concurrency.
For example, web servers like Unicorn for Ruby, or Gunicorn for Python, pre-fork a number of identical copies of your web servers (called workers). Unicorn then has its own connection queue, and as workers finish a web request, they pull a new request off of the queue.
Having more RAM in this scenario means that you can have more workers running concurrently - and there is typically a fairly linear correlation between RAM and concurrency. Optimizing concurrency for RAM is something this article addresses.
The performance profile of each dyno type can have an impact. In particular,
standard-2x dynos operate on a CPU-share basis, whereas
performance dynos are single tenant.
performance dynos therefore offer a higher level of resource isolation.
This can have a significant impact on applications, depending on the amount of traffic that they’re receiving and how well they’re optimized. In particular, a more consistent performance profile can lead to reduced tail latencies.
When to try a different dyno size
There are many factors that come into play when considering different dyno types. Some of them are inherent to your application (how much CPU does it use), some are due to optimization factors introduced by increased concurrency (due to having more RAM) and some due to the inherent characteristics of the dyno itself.
This complexity can be difficult to navigate, but the simple techniques suggested in this article for applications that are not CPU bound can be found make it a lot more tractable and easy to optimize for any dyno type.
Once you have optimized for a particular dyno type, say
standard-1x dynos, apply the same techniques on a
performance-l dyno - taking into account the factors that each dyno type introduce.
Here are some rough rules of thumb:
- For most applications that aren’t receiving tremendously high volumes of traffic, consider
- If the application is particularly memory-hungry, as seen in some Java-based frameworks such as Play and JRuby, consider
standard-2xdynos which doubles the memory.
- For very high volume web apps, running on more than 20
Basic methodology for optimizing memory
We suggest that you follow these steps, making use of visibility tools listed below, as well as the per-language suggestions. This will get you to a point where you can easily optimize for a single dyno type, or for moving between dyno types.
- Use a concurrent web server.
- Set up instrumentation to measure the impact of load on the app.
- Observe the app’s performance, and adjust the concurrency as necessary.
Optimizing is an iterative process - there is no golden path. Different languages, web frameworks and applications behave quite differently.
For example, a standard Ruby application may need to use a web server that forks multiple copies of an application to make use of all the RAM that is available. A standard Java application, on the other hand, may simply need a parameter to the JVM in order to allocate a larger heap.
Concurrent web servers
Different languages and platforms have different approaches to concurrency. Here’s a brief look at how to establish concurrency in apps running on Ruby, Java, Python and Node.js.
To see how you can optimize your application please refer to the comprehensive R14 - Memory Quota Exceeded in Ruby (MRI) article. It covers common problems for memory bloat in a Ruby application as well as several diagnostic tools and techniques for finding and correcting increased memory use in a Ruby application. Concurrency and Database Connections in Ruby with ActiveRecord is a great resource for evaluating how to factor in best practices for database connections to maximize concurrency, too.
JRuby servers like Puma make good use of concurrency without the need for multiple processes. However, you will need to tune the amount of memory allocated to the JVM, depending on the dyno type. The Ruby buildpack defines sensible defaults, which can be overridden by setting either
Java, Scala, Clojure
Java web servers like Jetty, Tomcat and Netty make good use of concurrency out of the box. However, you will need to tune the amount of memory allocated to the JVM, depending on the dyno type.
Read Adjusting Environment for a Dyno Size for appropriate
JAVA_OPTS flags to accomplish this.
If you want to optimize for increased concurrency, Heroku recommends that you use Gunicorn for Python apps.
Gunicorn works by forking a configurable number of child processes, called workers. Each worker can only process a single request at a time. Concurrency comes about because the master Gunicorn process queues new web requests, and these are then delegated to workers if they are available and have completed processing a previous request.
Increasing the concurrency is then configured by increasing the number of workers.
However, as each worker is effectively a forked version of your application, moving from a single worker to two workers will roughly double the memory requirements of your application.
It’s this trade off - between increased concurrency and memory available in a dyno, that you will measure and tune.
Read Deploying Python Applications with Gunicorn to learn how to set up Gunicorn for Python on Heroku. This will result in a web server with a config var,
WEB_CONCURRENCY, which will let you adjust the number of workers the main Unicorn process will fork.
While highly app dependent, the following table lists some rough rules of thumb for how many Unicorn workers can be run on each dyno type:
|Dyno Type||Number of Gunicorn workers|
|eco, basic, standard-1x||2-3|
These are just estimates, and will vary from app to app. Use something in the lower range, measure, and adjust as necessary.
For Django-specific recommendations, see Concurrency and Database Connections in Django.
Node offers a single-threaded, non-blocking process model. To take advantage of multiple cores, Node must use the Cluster API to fork multiple concurrent processes. Even if you don’t plan on using concurrency today, we recommend enabling Cluster in your app so that it can scale to a variety of containers.
Read Optimizing Node.js Concurrency to learn how to configure concurrency through Node’s Cluster API on Heroku.
Applications using the PHP or HHVM runtimes automatically adjust their number of worker processes or threads depending on the type of dyno they run on. The main factor to decide the number of processes or threads is the PHP memory limit that’s configured for an application.
Please refer to Optimizing PHP Application Concurrency for more information on tuning PHP applications for maximum throughput.
After setting up a concurrent web server, you’ll want to tune it for a particular dyno type. Measuring memory and throughput should provide enough guidance for you to make a judgement as to the impact of a change.
Measuring memory with log-runtime-metrics
The Heroku Labs log-runtime-metrics feature adds support for enabling visibility into load and memory usage for running dynos.
Per-dyno stats on memory use, swap use, and load average are inserted into the app’s log stream.
Here is some example output with this feature enabled:
source=web.1 dyno=heroku.2808254.d97d0ea7-cf3d-411b-b453-d2943a50b456 sample#load_avg_1m=2.46 sample#load_avg_5m=1.06 sample#load_avg_15m=0.99 source=web.1 dyno=heroku.2808254.d97d0ea7-cf3d-411b-b453-d2943a50b456 sample#memory_total=21.00MB sample#memory_rss=21.22MB sample#memory_cache=0.00MB sample#memory_swap=0.00MB sample#memory_pgpgin=348836pages sample#memory_pgpgout=343403pages
memory_rss is the most significant number here, providing an indication of total resident memory. Ensure that you don’t exceed the memory of your dyno type - and leave some head room too. Likewise, make sure you keep swap usage at minimum and the swapping activity (
memory_pgpgout) is minimal. Ideally
memory_pgpgout shouldn’t change much over time (rate of change is zero).
See log-runtime-metrics to understand how to interpret these figures.
The output of log-runtime-metrics is particularly useful as it lets you look at per-dyno memory usage. If you’re over-provisioned, you may see a single dyno peaking before any other.
There are other ways of visualizing this memory data:
The Librato add-on, with the Nickel plan and above, provides a way to graph the various output from log-runtime-metrics, averaging the values across all the dynos.
Here is sample output for a Rails application on
standard-1x dynos using 4 Unicorn workers. The memory, about 359MB at a peak, fits comfortably into the
standard-1x 512MB of RAM.
Measuring throughput and response time
Throughput, the number of requests being handled per minute, as well as response times, are particularly useful indicators of how an optimization has affected the performance of a dyno.
In particular, the 95th and 99th percentile response time values provided by add-ons like Librato or New Relic should be monitored closely.