Get pdf-attachments from Gmail as text

Edit: Updated for DriveApp, as DocsList deprecated.


I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.

As you’ve found out, getContentAsString() does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.

First, we’ll get the attachment as a Blob, a utility class used by several Services to exchange data.

var blob = attachments[0].getAs(MimeType.PDF);

So with the second problem separated out, and maintaining the assumption that we’re interested in only the first attachment of the first message of each thread labeled templabel, here is how myFunction() looks:

/**
 * Get messages labeled 'templabel', and send myself the text contents of
 * pdf attachments in new emails.
 */
function myFunction() {

  var threads = GmailApp.search('label:templabel');
  var threadsMessages = GmailApp.getMessagesForThreads(threads);

  for (var thread = 0; thread < threadsMessages.length; ++thread) {
    var message = threadsMessages[thread][0];
    var messageBody = message.getBody();
    var messageSubject = message.getSubject();
    var attachments = message.getAttachments();

    var blob = attachments[0].getAs(MimeType.PDF);
    var filetext = pdfToText( blob, {keepTextfile: false} );

    GmailApp.sendEmail(Session.getActiveUser().getEmail(), messageSubject, filetext);
  }
}

We’re relying on a helper function, pdfToText(), to convert our pdf blob into text, which we’ll then send to ourselves as a plain text email. This helper function has a variety of options; by setting keepTextfile: false, we’ve elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.

pdfToText()

This utility is available as a gist. Several examples are provided there.

A previous answer indicated that it was possible to use the Drive API’s insert method to perform OCR, but it didn’t provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable the Drive API from the editor, under Resources > Advanced Google Services.

pdfToText() uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the “pictures” of each page in the document – not much we can do about that. It then uses the regular DocumentService to extract the document body as plain text.

/**
 * See gist: https://gist.github.com/mogsdad/e6795e438615d252584f
 *
 * Convert pdf file (blob) to a text file on Drive, using built-in OCR.
 * By default, the text file will be placed in the root folder, with the same
 * name as source pdf (but extension 'txt'). Options:
 *   keepPdf (boolean, default false)     Keep a copy of the original PDF file.
 *   keepGdoc (boolean, default false)    Keep a copy of the OCR Google Doc file.
 *   keepTextfile (boolean, default true) Keep a copy of the text file.
 *   path (string, default blank)         Folder path to store file(s) in.
 *   ocrLanguage (ISO 639-1 code)         Default 'en'.
 *   textResult (boolean, default false)  If true and keepTextfile true, return
 *                                        string of text content. If keepTextfile
 *                                        is false, text content is returned without
 *                                        regard to this option. Otherwise, return
 *                                        id of textfile.
 *
 * @param {blob}   pdfFile    Blob containing pdf file
 * @param {object} options    (Optional) Object specifying handling details
 *
 * @returns {string}          id of text file (default) or text content
 */
function pdfToText ( pdfFile, options ) {
  // Ensure Advanced Drive Service is enabled
  try {
    Drive.Files.list();
  }
  catch (e) {
    throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
  }

  // Set default options
  options = options || {};
  options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;

  // Prepare resource object for file creation
  var parents = [];
  if (options.path) {
    parents.push( getDriveFolderFromPath (options.path) );
  }
  var pdfName = pdfFile.getName();
  var resource = {
    title: pdfName,
    mimeType: pdfFile.getContentType(),
    parents: parents
  };

  // Save PDF to Drive, if requested
  if (options.keepPdf) {
    var file = Drive.Files.insert(resource, pdfFile);
  }

  // Save PDF as GDOC
  resource.title = pdfName.replace(/pdf$/, 'gdoc');
  var insertOpts = {
    ocr: true,
    ocrLanguage: options.ocrLanguage || 'en'
  }
  var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);

  // Get text from GDOC  
  var gdocDoc = DocumentApp.openById(gdocFile.id);
  var text = gdocDoc.getBody().getText();

  // We're done using the Gdoc. Unless requested to keepGdoc, delete it.
  if (!options.keepGdoc) {
    Drive.Files.remove(gdocFile.id);
  }

  // Save text file, if requested
  if (options.keepTextfile) {
    resource.title = pdfName.replace(/pdf$/, 'txt');
    resource.mimeType = MimeType.PLAIN_TEXT;

    var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
    var textFile = Drive.Files.insert(resource, textBlob);
  }

  // Return result of conversion
  if (!options.keepTextfile || options.textResult) {
    return text;
  }
  else {
    return textFile.id
  }
}

The conversion to DriveApp is helped with this utility from Bruce McPherson:

// From: http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
  return (path || "https://stackoverflow.com/").split("https://stackoverflow.com/").reduce ( function(prev,current) {
    if (prev && current) {
      var fldrs = prev.getFoldersByName(current);
      return fldrs.hasNext() ? fldrs.next() : null;
    }
    else { 
      return current ? null : prev; 
    }
  },DriveApp.getRootFolder()); 
}

Leave a Comment