88592

How to extract text from Word files using C#?

I am trying to convert a large number (100,000) of word DOC files, these are quite old. From around 1995 to 2000 version of Word, i supposed. I keep going around in circles from what i see here in stack overflow and the MS documentation.

What i want do so is simply read the file, stick the text into a string, parse the string, take out the structure stuff (the file is actually a structured report, looks like Patient: Jon Doe). At that point, I know what i am doing. I can parse the string data, stick it into useful variables, then stick this data into a database. But I do not know how to actually put the text into a string. Any help?

PPS i found this reference which supposedly puts a DOC file into a text file. It's a start, but i'd rather avoid doing a bunch of file manipulations.

Answer1:

If you try to use the Word object model, you must always instantiate a certain version of Word on the client (since running Word on a server is not recommended). Unfortunately, you'll depend of the restriction of Word concerning older files, e.g. in Word 2010 you can open files from Office 95 only in sandbox mode (i.e you're not able to access the file content programmatically). Additionally, you'll have to deal with unknown template content (documents with macros attached, for example).

In your case I'd rather look for a 3p-component which allows to access the content. I know from document management systems like OpenText eDocs and Autonomy iManage that they use other tools to full-index documents of all types and can present the content in a viewer application. So if you look in this direction, may be you find something useful.

Answer2:

A word file is just a normal file as far as your code goes.

Try this:

using System.IO; StreamReader streamReader = new StreamReader(filePath); string text = streamReader.ReadToEnd(); streamReader.Close();

Recommend

  • Lisp - Splitting Input into Separate Strings
  • Having trouble with mapbox GL layers
  • Prune unnecessary leaves in sklearn DecisionTreeClassifier
  • How to restart a sketch project in processing?
  • Get functions/objects of imported .tlb
  • The default package '.' is not permitted by the Import-Package syntax
  • Clipboard.ContainsData and Clipboard.GetData
  • How to setup django 1.8 to use jinja2?
  • SQL Query Clarification Required [closed]
  • Duplicating records to fill gap between dates in Google BigQuery
  • why adding a space after `(.+?)` can completely change the result
  • Multi-line JSON read using Apache PIG
  • Gradle test fails with NullPointerException
  • python regex split string while keeping delimiter with value
  • Adding new column to DataFrame with values dependent on index ref
  • Safari PHP form submission -file upload hangs
  • Is there an API (SOAP, JSON, XML-RPC, REST, anything) to Google Code Issues?
  • AWS-SES: Handling Bounces for Invalid ISPs
  • ThreadStatic in asynchronous ASP.NET Web API
  • Looking for good analogy/examples for monitor verses semaphore
  • Word Open XML Mail Merge
  • Can I check if a recipient has an automatic reply before I send an email?
  • Change Inet root folder for iis 7
  • Java: can you cast Class into a specific interface?
  • Highlight one bar in a series in highcharts?
  • Recording logins for password protected directories
  • Android screen density dpi vs ppi
  • AES padding and writing the ciphertext to a disk file
  • How would I use PHP exceptions to define a redirect?
  • Updating server-side rendering client-side
  • How to check if every primary key value is being referenced as foreign key in another table
  • Sending data from AppleScript to FileMaker records
  • embed rChart in Markdown
  • need help with bizarre java.net.HttpURLConnection behavior
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • Does armcc optimizes non-volatile variables with -O0?
  • How to get NHibernate ISession to cache entity not retrieved by primary key
  • How can I use `wmic` in a Windows PE script?
  • Unable to use reactive element in my shiny app
  • Conditional In-Line CSS for IE and Others?