10442

How to split pages of a Word document into separate files in c# [closed]

Question:

I have an OCR program that converts images to word documents. The word document contains text of the all images, and I want to split it into separate files.

Is there any way to do this in c#?

thanks

Answer1:

Same as <a href="https://stackoverflow.com/a/11772470/111794" rel="nofollow">other answer</a>, but with an IEnumerator and an extension method to the document.

static class PagesExtension { public static IEnumerable<Range> Pages(this Document doc) { int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument]; int pageStart = 0; for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) { var page = doc.Range( pageStart ); if (currentPageIndex < pageCount) { //page.GoTo returns a new Range object, leaving the page object unaffected page.End = page.GoTo( What: WdGoToItem.wdGoToPage, Which: WdGoToDirection.wdGoToAbsolute, Count: currentPageIndex+1 ).Start-1; } else { page.End = doc.Range().End; } pageStart = page.End + 1; yield return page; } yield break; } }

The main code ends up like this:

static void Main(string[] args) { var app = new Application(); app.Visible = true; var doc = app.Documents.Open(@"path\to\source\document"); foreach (var page in doc.Pages()) { page.Copy(); var doc2 = app.Documents.Add(); doc2.Range().Paste(); } }

Answer2:

You can manipulate the Word document from C# using the Word object model, if you have Word installed.

First, add a reference to the Word object model. Right-click on the project, then Add Reference... -> COM -> Microsoft Word 14.0 Object Model (or something similar, depending on your version of Word).

Then, you can use the following code:

using Microsoft.Office.Interop.Word; //for older versions of Word use: //using Word; namespace WordSplitter { class Program { static void Main(string[] args) { //Create a new instance of Word var app = new Application(); //Show the Word instance. //If the code runs too slowly, you can show the application at the end of the program //Make sure it works properly first; otherwise, you'll get an error in a hidden window //(If it still runs too slowly, there are a few other ways to reduce screen updating) app.Visible = true; //We need a reference to the source document //It should be possible to get a reference to an open Word document, but I haven't tried it var doc = app.Documents.Open(@"path\to\file.doc"); //(Can also use .docx) int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument]; //We'll hold the start position of each page here int pageStart = 0; for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) { //This Range object will contain each page. var page = doc.Range(pageStart); //Generally, the end of the current page is 1 character before the start of the next. //However, we need to handle the last page -- since there is no next page, the //GoTo method will move to the *start* of the last page. if (currentPageIndex < pageCount) { //page.GoTo returns a new Range object, leaving the page object unaffected page.End = page.GoTo( What: WdGoToItem.wdGoToPage, Which: WdGoToDirection.wdGoToAbsolute, Count: currentPageIndex + 1 ).Start - 1; } else { page.End = doc.Range().End; } pageStart = page.End + 1; //Copy and paste the contents of the Range into a new document page.Copy(); var doc2 = app.Documents.Add(); doc2.Range().Paste(); } } } }

Reference: <a href="http://msdn.microsoft.com/en-us/library/kw65a0we" rel="nofollow">Word Object Model Overview on MSDN</a>

Answer3:

Not easily at the Word document end, though Word creates documents with w:lastRenderedPageBreak.

Best to have your OCR program insert some marker into the document between each block of converted text.

Then, depending on what sort of Word document it is, process the file with an appropriate tool.

Recommend

  • Not able to copy specific pages of word document
  • Open Word document, based on search pattern, goto that page
  • Eclipse not recognizing the android device
  • Embed the existing code of a method in a try-finally block
  • batch code to test if script contains anything other than Letters and Numbers
  • How to load data from a csv file into a table using Teradata Studio [closed]
  • java.lang.VerifyError: Expecting a stackmap frame occuring with ASM generated byte code
  • selenium scripts
  • pass command line arguments to Clozure common lisp
  • Error handling only works once
  • Nokogiri grab text with formatting and link tags, ,, , etc
  • How to return an element of an array in Batch?
  • Can I have more than 32 netlink sockets in kernelspace?
  • description for dll files
  • Mutate value by using a value from a different row in a tibble
  • Batch to process files one by one
  • How to check a string does not start with a number in Batch?
  • Optional parameter in UriTemplate in WCF
  • Setting Unknown Array Boundaries and Loop
  • Watir::Exception::MissingWayOfFindingObjectException: invalid attribute: :css
  • Cassandra: What is a subcolumn
  • How to populate html table with info from list in django
  • How to convert integer to string and get length of string
  • Prevent page break in text block with iText, XMLWorker
  • Shouldn't else be indented in the below code
  • Using Sax parsing to edit and write XML in VB6
  • How can I extract results of aggregate queries in slick?
  • WPF - CanExecute dosn't fire when raising Commands from a UserControl
  • Swift: Switch statement fallthrough behavior
  • Django: Count of Group Elements
  • Sending data from AppleScript to FileMaker records
  • MySQL WHERE-condition in procedure ignored
  • using conditional logic : check if record exists; if it does, update it, if not, create it
  • python regex in pyparsing
  • Android Google Maps API OnLocationChanged only called once
  • Django query for large number of relationships
  • Why is Django giving me: 'first_name' is an invalid keyword argument for this function?
  • How can I use `wmic` in a Windows PE script?
  • How to push additional view controllers onto NavigationController but keep the TabBar?
  • How can I use threading to 'tick' a timer to be accessed by other threads?