Jonathan Hoyt

This article was contributed by Jonathan Hoyt

Hoyt works at GitHub where he helps make Speaker Deck and Gauges awesome. He also writes about hacking life with a servant's heart at The Programming Butler.

Processing PDFs with Ruby and MongoDB

Last Updated: 22 October 2013

mongo pdf ruby

Table of Contents

Processing PDFs involves several long-running tasks that must operate independent of each other to form a single system that behaves in a scalable and reliable manner.

This article outlines the steps necessary to create the components for a PDF storage, search and retrieval system in Ruby that lets users upload PDFs, stores those in Amazon S3, generates thumbnail images and provides for basic searching of the PDFs.

PDF Archiver screenshot

The application asynchronously processes PDF files in the background enabling it to scale in a horizontal fashion and providing granular control of the compute resources assigned to each component.

The source for the referenced application can be found here.

If you have questions about Ruby on Heroku, consider discussing it in the Ruby on Heroku forums.

Reference application

Local deployment

Processing PDFs involves several system-level dependencies. To run the reference application in a local environment first install the Ghostscript, ImageMagick, Xpdf and MongoDB dependencies:

On a Mac use homebrew to install system dependencies.

$ brew install gs imagemagick xpdf mongo

Clone and run the app locally using Foreman:

$ git clone git://github.com/rwdaigle/demo-cedar-pdfarchive.git
Cloning into demo-cedar-pdfarchive...
...
Resolving deltas: 100% (90/90), done.
$ cd demo-cedar-pdfarchive
$ bundle install
Fetching git://github.com/jnicklas/carrierwave.git
...
Your bundle is complete!
$ foreman start
16:12:37 web.1     | started with pid 59893
16:12:37 worker.1  | started with pid 59895

The app should now be running at http://localhost:5000.

Deploy to Heroku

Create a Heroku application and set it as the production environment.

$ heroku create my-pdfarchive
Creating my-pdfarchive... done, stack is cedar
 http://my-pdfarchive.herokuapp.com/ | git@heroku.com:my-pdfarchive.git
Git remote heroku added
$ heroku config:set RACK_ENV=production
Adding config vars:
  RACK_ENV => production
Restarting app... done, v1.

Provision a free MongoDB add-on.

mongolab:starter is also available as a free MongoDB add-on.

$ heroku addons:add mongohq:sandbox

Provide the app with your AWS credentials and storage bucket for PDFs and preview images via config vars:

The Amazon S3 bucket should be created before running the application.

$ heroku config:set AWS_ACCESS_KEY_ID=YOUR_AWS_KEY \
                    AWS_SECRET_ACCESS_KEY=YOUR_ACCESS_KEY \
                    BUCKET_NAME=PDFARCHIVE_BUCKET_NAME

And deploy the application to Heroku.

$ git push heroku master
$ heroku ps:scale web=1 worker=1
$ heroku open

Object model

Building a decoupled system requires the use of several logically seperated components including domain object, file processing and uploading classes.

Document uploader

CarrierWave is a library for handling the receipt and processing of uploads and greatly simplifies the code necessary to support file uploading. PdfUploader is the PDF Archive’s CarrierWave uploader class that is responsible for storing the uploaded PDF and for the logic that retrieves the preview image from the PDF file itself.

The Grim gem is used to easily extract the first page of the document as the preview image. Grim::WIDTH = 100 tells Grim to extract pages as images with width of 100 pixels.

class PdfUploader < Uploader
  Grim::WIDTH = 100

  def grim
    cache_stored_file! unless cached?
    Grim.reap(cache_path)
  end

  def create_preview
    cache_stored_file! unless cached?
    output_path = File.join(cache_dir, 'preview.jpg')
    grim[0].save(output_path)
    return output_path
  end
end

The Uploader parent class contains shared uploader logic including the storage provider to use as well as the temporary file location to store the uploaded file before uploading to S3.

Preview image uploader

The preview image extracted from the PDF using will also need to be stored in Amazon S3. The PreviewStore CarrierWave uploader class handles that function in the same fashion as the PdfUploader and inherits all of its behavior from the Uploader parent class.

Document class

The domain model representing the PDF document itself is called Document and is responsible for assigning the CarrierWave uploaders to the appropriate attributes. This example uses the MongoMapper ORM to handle persistence of the Document model to a MongoDB database.

Hunt adds the searches class method to enable searching on a set of given attributes.

class Document
  include MongoMapper::Document
  plugin Hunt

  key :page_contents, Array
  mount_uploader :pdf, PdfUploader
  mount_uploader :preview, PreviewStore
  searches :pdf_filename, :page_contents
end

Mapping the PdfUploader and PreviewStore CarrierWave classes to the pdf and preview attributes automatically enables storage of those files to S3 on attribute assignment.

Uploading a PDF

Upload action

For the purpose of simplicity this example app has an action that receives the binary file upload mapped to the root route /. An instance of the Document class is created and a new background job is enqueued to process the upload.

post '/' do
  if params['pdf']
    document = Document.create!(params)
    Qu.enqueue(ProcessPdf, document.id)
  end

  erb :home
end

Background processing

It’s important to note that the heavy lifting of processing the PDF does not occur within this action. Doing so would tie up the web process for an inordinate amount of time and result in unpredictable behavior under heavy load. These long-running tasks are best kept outside the request/response lifecycle in a background job.

Qu is a Resque API-compliant queuing library that supports multiple backends. This application uses the same Mongo instance for queuing support as for object persistence.

In this example the Qu library is used to queue the ProcessPdf task for preview image and content extraction. Invoking Qu.enqueue places a job on the queue which is picked up and invoked in a background worker.

The application’s Procfile declares the background worker, which in this case processes items that have been enqueued:

web: bundle exec ruby lib/pdf_archive.rb -p $PORT
worker: bundle exec rake qu:work QUEUE=default

To have the application executing properly, at least one web process type is required to run the web interface and receive file uploads and one worker process type to perform the time-consuming background processing.

It is quite common to asymmetrically scale the front-end and background process types as the demands of serving a simple HTML page are quite different from processing binary files such as PDFs. Scaling the worker process type individually will allow for more rapid processing of the worker queue should a backlog accumulate.

$ heroku ps:scale worker+2

PDF processing

The task of extracting a preview image and the content text from the PDF file occurs with the ProcessPdf.

class ProcessPdf
  def self.perform(document_id)
    document = Document.find!(document_id)
    pdf = document.pdf

    if pdf.grim.count > 0
      document.preview = File.open(pdf.create_preview)

      pdf.grim.each do |page|
        document.page_contents << page.text
      end

      document.save!
    else
      raise 'PDF has no content'
    end
  end
end

Grim, which was added to the PdfUploader class, can be leveraged to sequentially extract the textual contents of each page of the PDF. In this case the contents are stored in a page_contents array attribute that is searchable by way of hunt within the Document model.

After the background job runs the file’s contents will be searchable and its preview image will be stored in Amazon S3, completing the full lifecycle of receiving and processing a PDF.

Vendor binaries

Follow these instructions to compile binaries directly for use on Heroku. Alternatively, binary dependencies can be specified in gem format.

The gs and pdftotext (part of xpdf) binaries required for running the PDF Archive application locally are also required in production but are not available by default on Heroku. These binaries are included in the application source in the bin directory so they can be executed by the application on Heroku.

Conclusion

Processing PDFs involves several discrete steps, each of which has different performance characteristics. Uploading of a document occurs within a user’s request/response lifecycle and is kept as performant and predictable as possible to achieve ideal end-user response times. The processing of the PDF contents occurs within a background job to offload the most time-consuming portion of the full file lifecycle and provide the ability to more granularly scale individual aspects of the application.

The PDF Archive reference application can be seen running on Heroku at http://demo-cedar-pdfarchive.herokuapp.com.