Parse Microsoft Office files in Node.JS


I'm working on a web application where users can upload Microsoft Office Document files. Right now, our server is running Node.JS with Express.js and we're hosted on Heroku. Because of this, I don't think that I can install programs such as abiword or catdoc. I can handle the file uploads, but can't parse the contents of the document.

How can I read the contents of the doc file? The information will then be put into a database. It'd be nice to preserve basic formatting (bold, italic, underline), but not essential.


While there don't seem to be anything you can get with NPM that will do Word directly, you might be able to use a REST API to request it via another cloud service. For example <a href="http://saaspose.com/" rel="nofollow">Saaspose</a> (they of the famous Aspose tools) have public API for <a href="http://saaspose.com/api/words" rel="nofollow">Word</a>, <a href="http://saaspose.com/api/cells" rel="nofollow">Excel</a>, <a href="http://saaspose.com/api/pdf" rel="nofollow">PDF</a>, and others. They list node.js, javascript, and Heroku support on their page.


I see that Saaspose is now called <a href="http://www.aspose.com/cloud/total-api.aspx" rel="nofollow">Aspose for Cloud</a>

Another API that claims something similar is <a href="http://www.doxument.com/" rel="nofollow">Doxument</a>


<a href="http://github.com/dkiyatkin/node-office" rel="nofollow">Office</a> package: npm install office seems to provide at least part of the answer. I use it to read Excel files, so far have not tried any Word docs.


There doesn't seem to be any yet. See below for something that might help.

<a href="https://stackoverflow.com/questions/9038231/can-i-read-pdf-or-word-docs-with-node-js" rel="nofollow">Can I read PDF or Word Docs with Node.js?</a>


You can use mammoth to parse .docx files <a href="https://www.npmjs.com/package/mammoth" rel="nofollow">https://www.npmjs.com/package/mammoth</a> and xlsx to parse .xlsx files <a href="https://github.com/SheetJS/js-xlsx" rel="nofollow">https://github.com/SheetJS/js-xlsx</a>


  • d3js: time scaling and “1901”
  • angularjs ng-bind-html html input part missing
  • Slow performance in hybrid AngularJS and Angular application in Safari
  • WP7 difficulties binding data to listbox itemssource - won't refresh
  • Ruby 1.8.6 Array#uniq not removing duplicate hashes
  • Configure Spring's MappingJacksonHttpMessageConverter
  • AppleScript : find open tab in safari by name and open it
  • Jenkins: FATAL: Could not initialize class hudson.util.ProcessTree$UnixReflection
  • How do I get HTML corresponding to current DOM tree?
  • JQuery Internet Explorer and ajaxstop
  • How to attach a node.js readable stream to a Sendgrid email?
  • JSON response opens as a file, but I can't access it with JavaScript
  • Django rest serializer Breaks when data exists
  • PostgreSQL Query without WHERE only ORDER BY and LIMIT doesn't use index
  • Java: can you cast Class into a specific interface?
  • How to rebase a series of branches?
  • Is there a javascript serializer for JSON.Net?
  • Java Scanner input dilemma. Automatically inputs without allowing user to type
  • Master page gives error
  • Azure Cloud Service Web Role web pages do not load
  • AES padding and writing the ciphertext to a disk file
  • Updating server-side rendering client-side
  • DirectX11 ClearRenderTargetViewback with transparent buffer?
  • How to extract text from Word files using C#?
  • JSON with duplicate key names losing information when parsed
  • Change an a tag attribute in JavaScript based on screen width
  • what is the difference between the asp.net mvc application and asp.net web application
  • Why is the timeout on a windows udp receive socket always 500ms longer than set by SO_RCVTIMEO?
  • Web-crawler for facebook in python
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • InvalidAuthenticityToken between subdomains when logging in with Rails app
  • Unit Testing MVC Web Application in Visual Studio and Problem with QTAgent
  • SQL merge duplicate rows and join values that are different
  • Load html files in TinyMce
  • How can I get HTML syntax highlighting in my editor for CakePHP?
  • coudnt use logback because of log4j
  • LevelDB C iterator
  • Can't mass-assign protected attributes when import data from csv file
  • sending mail using smtp is too slow
  • How to Embed XSL into XML