How to split pages of a Word document into separate files in c# [closed]


I have an OCR program that converts images to word documents. The word document contains text of the all images, and I want to split it into separate files.

Is there any way to do this in c#?



Same as <a href="https://stackoverflow.com/a/11772470/111794" rel="nofollow">other answer</a>, but with an IEnumerator and an extension method to the document.

static class PagesExtension { public static IEnumerable<Range> Pages(this Document doc) { int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument]; int pageStart = 0; for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) { var page = doc.Range( pageStart ); if (currentPageIndex < pageCount) { //page.GoTo returns a new Range object, leaving the page object unaffected page.End = page.GoTo( What: WdGoToItem.wdGoToPage, Which: WdGoToDirection.wdGoToAbsolute, Count: currentPageIndex+1 ).Start-1; } else { page.End = doc.Range().End; } pageStart = page.End + 1; yield return page; } yield break; } }

The main code ends up like this:

static void Main(string[] args) { var app = new Application(); app.Visible = true; var doc = app.Documents.Open(@"path\to\source\document"); foreach (var page in doc.Pages()) { page.Copy(); var doc2 = app.Documents.Add(); doc2.Range().Paste(); } }


You can manipulate the Word document from C# using the Word object model, if you have Word installed.

First, add a reference to the Word object model. Right-click on the project, then Add Reference... -> COM -> Microsoft Word 14.0 Object Model (or something similar, depending on your version of Word).

Then, you can use the following code:

using Microsoft.Office.Interop.Word; //for older versions of Word use: //using Word; namespace WordSplitter { class Program { static void Main(string[] args) { //Create a new instance of Word var app = new Application(); //Show the Word instance. //If the code runs too slowly, you can show the application at the end of the program //Make sure it works properly first; otherwise, you'll get an error in a hidden window //(If it still runs too slowly, there are a few other ways to reduce screen updating) app.Visible = true; //We need a reference to the source document //It should be possible to get a reference to an open Word document, but I haven't tried it var doc = app.Documents.Open(@"path\to\file.doc"); //(Can also use .docx) int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument]; //We'll hold the start position of each page here int pageStart = 0; for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) { //This Range object will contain each page. var page = doc.Range(pageStart); //Generally, the end of the current page is 1 character before the start of the next. //However, we need to handle the last page -- since there is no next page, the //GoTo method will move to the *start* of the last page. if (currentPageIndex < pageCount) { //page.GoTo returns a new Range object, leaving the page object unaffected page.End = page.GoTo( What: WdGoToItem.wdGoToPage, Which: WdGoToDirection.wdGoToAbsolute, Count: currentPageIndex + 1 ).Start - 1; } else { page.End = doc.Range().End; } pageStart = page.End + 1; //Copy and paste the contents of the Range into a new document page.Copy(); var doc2 = app.Documents.Add(); doc2.Range().Paste(); } } } }

Reference: <a href="http://msdn.microsoft.com/en-us/library/kw65a0we" rel="nofollow">Word Object Model Overview on MSDN</a>


Not easily at the Word document end, though Word creates documents with w:lastRenderedPageBreak.

Best to have your OCR program insert some marker into the document between each block of converted text.

Then, depending on what sort of Word document it is, process the file with an appropriate tool.


