Javascript – Batch converting doc/docx to pdf using Javascript

docdocxjavajavascript

I'm working on a Java program that programmatically converts .doc- and .docx-files to pdf. I've tested several different ways to convert .doc- and .docx-files to pdf such as using several open source Java libraries, sadly these libraries would often mess up the layout in the documents.

I've stumbled upon a javascript script to use the underlying Microsoft Word instance to open the file and save it as a PDF (found at: https://superuser.com/questions/17612/batch-convert-word-documents-to-pdfs-free/28303#28303):

var fso = new ActiveXObject("Scripting.FileSystemObject");
var docPath = WScript.Arguments(0);
var pdfPath = WScript.Arguments(1);
docPath = fso.GetAbsolutePathName(docPath);
var objWord = null;
try{
    WScript.Echo("Saving '" + docPath + "' as '" + pdfPath + "'...");
    objWord = new ActiveXObject("Word.Application");
    objWord.Visible = false;
    var objDoc = objWord.Documents.Open(docPath);
    var wdFormatPdf = 17;
    objDoc.SaveAs(pdfPath, wdFormatPdf);
    objDoc.Close();
    WScript.Echo("The CV was succesfully converted.");
} catch(err){
    WScript.Echo("An error occured: " + err.message);
}finally{
    if (objWord != null){
        objWord.Quit();
    }
}

This javascript-script is called from my Java program synchronously for each document.

On a small scale this seems to work great, but when dealing with a lot of documents like several thousands, I encountered a couple of problems:

  • Sometimes one Word process would hang at the 'Save as'-prompt, if this happened user intervention was needed to continue. Until any user interaction the process would just block.
  • Sometimes the Word process would hang at a 'Bookmark'-prompt. The process is also blocked until any user intervention to pass the prompt.

I'm looking for the best/cleanest way to somehow control these Word processes better by giving them a deadline or something. Like giving them 5 seconds to open the Word document and save it as a PDF, after 5 seconds the process would be killed if still active.

I've dealt with something similiar in the past and the solution for that included a 'kill word processes batch script' to kill any WORD processes that were stuck after the program ended. Not very clean but it did its job.

Any experiences or ideas would be appreciated!

Best Answer

You can use https://www.npmjs.com/package/@nativedocuments/docx-wasm serverless (eg AWS Lambda) to perform your conversions in parallel. Lambda takes care of the concurrency. docx-wasm is self-contained (ie no need to be running Microsoft Word). Freemium model.

Edit April 2019

https://github.com/NativeDocuments/docx-to-pdf-on-AWS-Lambda is a sample project for using it on Lambda.