This article was contributed by Jonathan Hoyt
Processing PDFs with Ruby and MongoDB
Last updated 07 May 2015
Table of Contents
Processing PDFs involves several long-running tasks that must operate independent of each other to form a single system that behaves in a scalable and reliable manner.
This article outlines the steps necessary to create the components for a PDF storage, search and retrieval system in Ruby that lets users upload PDFs, stores those in Amazon S3, generates thumbnail images and provides for basic searching of the PDFs.
The application asynchronously processes PDF files in the background enabling it to scale in a horizontal fashion and providing granular control of the compute resources assigned to each component.
The source for the referenced application can be found here.
If you have questions about Ruby on Heroku, consider discussing it in the Ruby on Heroku forums.
Processing PDFs involves several system-level dependencies. To run the reference application in a local environment first install the Ghostscript, ImageMagick, Xpdf and MongoDB dependencies:
On a Mac use homebrew to install system dependencies.
$ brew install gs imagemagick xpdf mongo
Clone and run the app locally using Foreman:
$ git clone git://github.com/rwdaigle/demo-cedar-pdfarchive.git Cloning into demo-cedar-pdfarchive... ... Resolving deltas: 100% (90/90), done. $ cd demo-cedar-pdfarchive $ bundle install Fetching git://github.com/jnicklas/carrierwave.git ... Your bundle is complete! $ foreman start 16:12:37 web.1 | started with pid 59893 16:12:37 worker.1 | started with pid 59895
The app should now be running at http://localhost:5000.
Deploy to Heroku
Create a Heroku application and set it as the
$ heroku create my-pdfarchive Creating my-pdfarchive... done, stack is cedar-14 http://my-pdfarchive.herokuapp.com/ | firstname.lastname@example.org:my-pdfarchive.git Git remote heroku added $ heroku config:set RACK_ENV=production Adding config vars: RACK_ENV => production Restarting app... done, v1.
Provision a free MongoDB add-on.
mongolab:starter is also available as a free MongoDB add-on.
$ heroku addons:create mongohq:sandbox
Provide the app with your AWS credentials and storage bucket for PDFs and preview images via config vars:
The Amazon S3 bucket should be created before running the application.
$ heroku config:set AWS_ACCESS_KEY_ID=YOUR_AWS_KEY \ AWS_SECRET_ACCESS_KEY=YOUR_ACCESS_KEY \ BUCKET_NAME=PDFARCHIVE_BUCKET_NAME
And deploy the application to Heroku.
$ git push heroku master $ heroku ps:scale web=1 worker=1 $ heroku open
Building a decoupled system requires the use of several logically seperated components including domain object, file processing and uploading classes.
CarrierWave is a library for handling the receipt and processing of uploads and greatly simplifies the code necessary to support file uploading.
PdfUploader is the PDF Archive’s CarrierWave uploader class that is responsible for storing the uploaded PDF and for the logic that retrieves the preview image from the PDF file itself.
The Grim gem is used to easily extract the first page of the document as the preview image.
Grim::WIDTH = 100 tells Grim to extract pages as images with width of 100 pixels.
class PdfUploader < Uploader Grim::WIDTH = 100 def grim cache_stored_file! unless cached? Grim.reap(cache_path) end def create_preview cache_stored_file! unless cached? output_path = File.join(cache_dir, 'preview.jpg') grim.save(output_path) return output_path end end
Uploader parent class contains shared uploader logic including the storage provider to use as well as the temporary file location to store the uploaded file before uploading to S3.
Preview image uploader
The preview image extracted from the PDF using will also need to be stored in Amazon S3. The
PreviewStore CarrierWave uploader class handles that function in the same fashion as the
PdfUploader and inherits all of its behavior from the
Uploader parent class.
The domain model representing the PDF document itself is called
Document and is responsible for assigning the CarrierWave uploaders to the appropriate attributes. This example uses the MongoMapper ORM to handle persistence of the
Document model to a MongoDB database.
Hunt adds the
searches class method to enable searching on a set of given attributes.
class Document include MongoMapper::Document plugin Hunt key :page_contents, Array mount_uploader :pdf, PdfUploader mount_uploader :preview, PreviewStore searches :pdf_filename, :page_contents end
PreviewStore CarrierWave classes to the
preview attributes automatically enables storage of those files to S3 on attribute assignment.
Uploading a PDF
For the purpose of simplicity this example app has an action that receives the binary file upload mapped to the root route
/. An instance of the
Document class is created and a new background job is enqueued to process the upload.
post '/' do if params['pdf'] document = Document.create!(params) Qu.enqueue(ProcessPdf, document.id) end erb :home end
It’s important to note that the heavy lifting of processing the PDF does not occur within this action. Doing so would tie up the web process for an inordinate amount of time and result in unpredictable behavior under heavy load. These long-running tasks are best kept outside the request/response lifecycle in a background job.
In this example the Qu library is used to queue the
ProcessPdf task for preview image and content extraction. Invoking
Qu.enqueue places a job on the queue which is picked up and invoked in a background worker.
Procfile declares the background worker, which in this case processes items that have been enqueued:
web: bundle exec ruby lib/pdf_archive.rb -p $PORT worker: bundle exec rake qu:work QUEUE=default
To have the application executing properly, at least one
web process type is required to run the web interface and receive file uploads and one
worker process type to perform the time-consuming background processing.
It is quite common to asymmetrically scale the front-end and background process types as the demands of serving a simple HTML page are quite different from processing binary files such as PDFs. Scaling the
worker process type individually will allow for more rapid processing of the worker queue should a backlog accumulate.
$ heroku ps:scale worker+2
The task of extracting a preview image and the content text from the PDF file occurs with the
class ProcessPdf def self.perform(document_id) document = Document.find!(document_id) pdf = document.pdf if pdf.grim.count > 0 document.preview = File.open(pdf.create_preview) pdf.grim.each do |page| document.page_contents << page.text end document.save! else raise 'PDF has no content' end end end
Grim, which was added to the
PdfUploader class, can be leveraged to sequentially extract the textual contents of each page of the PDF. In this case the contents are stored in a
page_contents array attribute that is searchable by way of hunt within the
After the background job runs the file’s contents will be searchable and its preview image will be stored in Amazon S3, completing the full lifecycle of receiving and processing a PDF.
pdftotext (part of
xpdf) binaries required for running the PDF Archive application locally are also required in production but are not available by default on Heroku. These binaries are included in the application source in the
bin directory so they can be executed by the application on Heroku.
Processing PDFs involves several discrete steps, each of which has different performance characteristics. Uploading of a document occurs within a user’s request/response lifecycle and is kept as performant and predictable as possible to achieve ideal end-user response times. The processing of the PDF contents occurs within a background job to offload the most time-consuming portion of the full file lifecycle and provide the ability to more granularly scale individual aspects of the application.
The PDF Archive reference application can be seen running on Heroku at http://demo-cedar-pdfarchive.herokuapp.com.