Reading legacy Word forms checkboxes converted to PDF


Our customers sends us orders as PDF forms which is generated from a Word document built with legacy forms.

Currently people at our customer center is punching the orders into our system, but we have decided to try and automate this task.

I'm able to read the content of the PDF with a simple PdfReader per page:

public static string GetPdfText(string path) { var text = string.Empty; using (var reader = new PdfReader(path)) { for (var page = 1; page <= reader.NumberOfPages; page++) { text += PdfTextExtractor.GetTextFromPage(reader, page); } } return text; }

But not the checkboxes...

I am able to detect the checkboxes as dictionaries while running through every object in the PDF, but I'm unable to distinguish them from other objects or read the value...

public static IEnumerable<PdfDictionary> ReadCheckboxes(string path) { using (var reader = new PdfReader(path)) { var checkboxes = new List<PdfDictionary>(); for (var i = 0; i < reader.XrefSize; i++) { var pdfObject = reader.GetPdfObject(i); checkboxes.Add((PdfDictionary) pdfObject); } return checkboxes; } }

What am I missing? I've also tried reading the AcroFields, but they're empty...

I have uploaded a sample PDF with legacy checkboxes <a href="https://www.dropbox.com/s/4z7ky3yy2yaj53i/Doc1.pdf?dl=0" rel="nofollow">here</a>.

Currently there is not option to integrate between our systems or do any changes to the underlying PDF or Word document.


<em>The OP indicated in comments that a solution which returns an output like "checkbox at position <strong>x</strong><sub>0</sub>, <strong>y</strong><sub>0</sub>, checked; checkbox at position <strong>x</strong><sub>1</sub>, <strong>y</strong><sub>1</sub>, not checked; ..." would suffice, i.e. his "forms" are static enough so that these positions allow identification of the meaning of the respective checkboxes. Thus, here an implementation of this variant.</em>

<em>I just saw that the question is tagged <a href="/questions/tagged/c%23" class="post-tag" title="show questions tagged 'c#'" rel="nofollow">c#</a> while I have implemented the search using Java. This should not be too big a problem, the code should be easy to port. If there are problems porting, I'll add a C# version here.</em>

As the checkboxes are drawn using vector graphics, the text extraction already used by the OP does not find them. Fortunately, though, the iText parsing framework can also be used to look for vector graphics.

Thus, we first need an ExtRenderListener (IExtRenderListener in iTextSharp) which collects the boxes. It only has non-trivial implementations of the interface methods modifyPath and renderPath:

@Override public void modifyPath(PathConstructionRenderInfo renderInfo) { switch (renderInfo.getOperation()) { case PathConstructionRenderInfo.RECT: { float x = renderInfo.getSegmentData().get(0); float y = renderInfo.getSegmentData().get(1); float w = renderInfo.getSegmentData().get(2); float h = renderInfo.getSegmentData().get(3); rectangle = new Rectangle(x, y, x+w, y+h); } case PathConstructionRenderInfo.MOVETO: { float x = renderInfo.getSegmentData().get(0); float y = renderInfo.getSegmentData().get(1); moveToVector = new Vector(x, y, 1); lineToVector = null; break; } case PathConstructionRenderInfo.LINETO: { if (moveToVector != null) { float x = renderInfo.getSegmentData().get(0); float y = renderInfo.getSegmentData().get(1); lineToVector = new Vector(x, y, 1); } break; } default: moveToVector = null; lineToVector = null; } } @Override public Path renderPath(PathPaintingRenderInfo renderInfo) { if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP) { if (rectangle != null) { Vector a = new Vector(rectangle.getLeft(), rectangle.getBottom(), 1).cross(renderInfo.getCtm()); Vector b = new Vector(rectangle.getRight(), rectangle.getBottom(), 1).cross(renderInfo.getCtm()); Vector c = new Vector(rectangle.getRight(), rectangle.getTop(), 1).cross(renderInfo.getCtm()); Vector d = new Vector(rectangle.getLeft(), rectangle.getTop(), 1).cross(renderInfo.getCtm()); Box box = new Box(new LineSegment(a, c), new LineSegment(b, d)); boxes.add(box); } if (moveToVector != null && lineToVector != null) { if (!boxes.isEmpty()) { Vector from = moveToVector.cross(renderInfo.getCtm()); Vector to = lineToVector.cross(renderInfo.getCtm()); boxes.get(boxes.size() - 1).selectDiagonal(new LineSegment(from, to)); } } } moveToVector = null; lineToVector = null; rectangle = null; return null; } Vector moveToVector = null; Vector lineToVector = null; Rectangle rectangle = null; public Iterable<Box> getBoxes() { return boxes; } final List<Box> boxes = new ArrayList<Box>();

<em>(from <a href="https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/extract/CheckBoxExtractionStrategy.java#L102" rel="nofollow">CheckBoxExtractionStrategy.java</a>)</em>

It uses a helper class Box which models the checkboxes using their respective diagonals:

public class Box { public LineSegment getDiagonal() { return diagonalA; } public boolean isChecked() { return selectedA && selectedB; } Box(LineSegment diagonalA, LineSegment diagonalB) { this.diagonalA = diagonalA; this.diagonalB = diagonalB; } void selectDiagonal(LineSegment diagonal) { if (approximatelyEquals(diagonal, diagonalA)) selectedA = true; else if (approximatelyEquals(diagonal, diagonalB)) selectedB = true; } boolean approximatelyEquals(LineSegment a, LineSegment b) { float permissiveness = a.getLength() / 10.0f; if (approximatelyEquals(a.getStartPoint(), b.getStartPoint(), permissiveness) && approximatelyEquals(a.getEndPoint(), b.getEndPoint(), permissiveness)) return true; if (approximatelyEquals(a.getStartPoint(), b.getEndPoint(), permissiveness) && approximatelyEquals(a.getEndPoint(), b.getStartPoint(), permissiveness)) return true; return false; } boolean approximatelyEquals(Vector a, Vector b, float permissiveness) { return a.subtract(b).length() < permissiveness; } boolean selectedA = false; boolean selectedB = false; final LineSegment diagonalA, diagonalB; }

<em>(Inner class in <a href="https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/extract/CheckBoxExtractionStrategy.java#L34" rel="nofollow">CheckBoxExtractionStrategy.java</a>)</em>

Applying it like this to the sample document:

for (int page = 1; page <= pdfReader.getNumberOfPages(); page++) { System.out.printf("\nPage %s\n====\n", page); CheckBoxExtractionStrategy strategy = new CheckBoxExtractionStrategy(); PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader); parser.processContent(page, strategy); for (Box box : strategy.getBoxes()) { Vector basePoint = box.getDiagonal().getStartPoint(); System.out.printf("at %s, %s - %s\n", basePoint.get(Vector.I1), basePoint.get(Vector.I2), box.isChecked() ? "checked" : "unchecked"); } }

one gets the output

<blockquote> Page 1 ==== at 73.104, 757.8 - checked at 86.544, 757.8 - checked at 99.984, 757.8 - unchecked </blockquote>

for the OP's document

<a href="https://i.stack.imgur.com/Y88UL.png" rel="nofollow"><img alt="Screenshot of Doc1.pdf" class="b-lazy" data-src="https://i.stack.imgur.com/Y88UL.png" data-original="https://i.stack.imgur.com/Y88UL.png" src="https://etrip.eimg.top/images/2019/05/07/timg.gif" /></a>


  • Legend control with two data frames of different x-scales and different geoms in ggplot2
  • Force ggplot legend to show all categories when no values are present [duplicate]
  • How to move object along the polygons
  • Java Circle-Circle Collision Detection
  • No projects found to import
  • Adding/Removing Lines from a JPanel
  • CFNetwork SSLHandshake failed (-9806) & (-9800) & (-9830)
  • Creating a layer of gradient within an SVG path dynamically
  • How to sort a same column both in asc order and desc order
  • How to name a 'group' check box in Adobe Reader when wanting to fill form by FDF / XFDF
  • mave 3.2 not able to access local nexus instance return 502 code
  • Add reference to ASP.NET 5 Class Library from Framework 4.5 Class Library Project
  • Error in installing package: fatal error: stdlib.h: no such file or directory
  • Find group of records that match multiple values
  • How to make R's read_csv2() recognise the text characters properly
  • Using Sax parsing to edit and write XML in VB6
  • Angularjs pass function from Controller to Directive (or call controller function from directive) -
  • Do I need to reset a Perl hash index?
  • Blackberry - Custom EditField Cursor
  • Use of this Javascript
  • Body moving without any force applied? (Box2d)
  • Lost migrations and Azure database is now out of sync
  • Deselecting radio buttons while keeping the View Model in synch
  • Nant, Vault & Windows Integrated Authentication
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • Excel - Autoshape get it's name from cell (value)
  • Check if a string to interpolate provides expected placeholders
  • Fill an image in a square container while keeping aspect ratio
  • Deserializing XML into class C#
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • Rearranging Cells in UITableView Bug & Saving Changes
  • RestKit - RKRequestDelegate does not exist
  • Traverse Array and Display in markup
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • Android Google Maps API OnLocationChanged only called once
  • How does Linux kernel interrupt the application?
  • python draw pie shapes with colour filled
  • Reading document lines to the user (python)
  • How to Embed XSL into XML
  • git trying to push non-existent file … after clearing cache