Reading legacy Word forms checkboxes converted to PDF


Our customers sends us orders as PDF forms which is generated from a Word document built with legacy forms.

Currently people at our customer center is punching the orders into our system, but we have decided to try and automate this task.

I'm able to read the content of the PDF with a simple PdfReader per page:

public static string GetPdfText(string path) { var text = string.Empty; using (var reader = new PdfReader(path)) { for (var page = 1; page <= reader.NumberOfPages; page++) { text += PdfTextExtractor.GetTextFromPage(reader, page); } } return text; }

But not the checkboxes...

I am able to detect the checkboxes as dictionaries while running through every object in the PDF, but I'm unable to distinguish them from other objects or read the value...

public static IEnumerable<PdfDictionary> ReadCheckboxes(string path) { using (var reader = new PdfReader(path)) { var checkboxes = new List<PdfDictionary>(); for (var i = 0; i < reader.XrefSize; i++) { var pdfObject = reader.GetPdfObject(i); checkboxes.Add((PdfDictionary) pdfObject); } return checkboxes; } }

What am I missing? I've also tried reading the AcroFields, but they're empty...

I have uploaded a sample PDF with legacy checkboxes <a href="https://www.dropbox.com/s/4z7ky3yy2yaj53i/Doc1.pdf?dl=0" rel="nofollow">here</a>.

Currently there is not option to integrate between our systems or do any changes to the underlying PDF or Word document.


<em>The OP indicated in comments that a solution which returns an output like "checkbox at position <strong>x</strong><sub>0</sub>, <strong>y</strong><sub>0</sub>, checked; checkbox at position <strong>x</strong><sub>1</sub>, <strong>y</strong><sub>1</sub>, not checked; ..." would suffice, i.e. his "forms" are static enough so that these positions allow identification of the meaning of the respective checkboxes. Thus, here an implementation of this variant.</em>

<em>I just saw that the question is tagged <a href="/questions/tagged/c%23" class="post-tag" title="show questions tagged 'c#'" rel="nofollow">c#</a> while I have implemented the search using Java. This should not be too big a problem, the code should be easy to port. If there are problems porting, I'll add a C# version here.</em>

As the checkboxes are drawn using vector graphics, the text extraction already used by the OP does not find them. Fortunately, though, the iText parsing framework can also be used to look for vector graphics.

Thus, we first need an ExtRenderListener (IExtRenderListener in iTextSharp) which collects the boxes. It only has non-trivial implementations of the interface methods modifyPath and renderPath:

@Override public void modifyPath(PathConstructionRenderInfo renderInfo) { switch (renderInfo.getOperation()) { case PathConstructionRenderInfo.RECT: { float x = renderInfo.getSegmentData().get(0); float y = renderInfo.getSegmentData().get(1); float w = renderInfo.getSegmentData().get(2); float h = renderInfo.getSegmentData().get(3); rectangle = new Rectangle(x, y, x+w, y+h); } case PathConstructionRenderInfo.MOVETO: { float x = renderInfo.getSegmentData().get(0); float y = renderInfo.getSegmentData().get(1); moveToVector = new Vector(x, y, 1); lineToVector = null; break; } case PathConstructionRenderInfo.LINETO: { if (moveToVector != null) { float x = renderInfo.getSegmentData().get(0); float y = renderInfo.getSegmentData().get(1); lineToVector = new Vector(x, y, 1); } break; } default: moveToVector = null; lineToVector = null; } } @Override public Path renderPath(PathPaintingRenderInfo renderInfo) { if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP) { if (rectangle != null) { Vector a = new Vector(rectangle.getLeft(), rectangle.getBottom(), 1).cross(renderInfo.getCtm()); Vector b = new Vector(rectangle.getRight(), rectangle.getBottom(), 1).cross(renderInfo.getCtm()); Vector c = new Vector(rectangle.getRight(), rectangle.getTop(), 1).cross(renderInfo.getCtm()); Vector d = new Vector(rectangle.getLeft(), rectangle.getTop(), 1).cross(renderInfo.getCtm()); Box box = new Box(new LineSegment(a, c), new LineSegment(b, d)); boxes.add(box); } if (moveToVector != null && lineToVector != null) { if (!boxes.isEmpty()) { Vector from = moveToVector.cross(renderInfo.getCtm()); Vector to = lineToVector.cross(renderInfo.getCtm()); boxes.get(boxes.size() - 1).selectDiagonal(new LineSegment(from, to)); } } } moveToVector = null; lineToVector = null; rectangle = null; return null; } Vector moveToVector = null; Vector lineToVector = null; Rectangle rectangle = null; public Iterable<Box> getBoxes() { return boxes; } final List<Box> boxes = new ArrayList<Box>();

<em>(from <a href="https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/extract/CheckBoxExtractionStrategy.java#L102" rel="nofollow">CheckBoxExtractionStrategy.java</a>)</em>

It uses a helper class Box which models the checkboxes using their respective diagonals:

public class Box { public LineSegment getDiagonal() { return diagonalA; } public boolean isChecked() { return selectedA && selectedB; } Box(LineSegment diagonalA, LineSegment diagonalB) { this.diagonalA = diagonalA; this.diagonalB = diagonalB; } void selectDiagonal(LineSegment diagonal) { if (approximatelyEquals(diagonal, diagonalA)) selectedA = true; else if (approximatelyEquals(diagonal, diagonalB)) selectedB = true; } boolean approximatelyEquals(LineSegment a, LineSegment b) { float permissiveness = a.getLength() / 10.0f; if (approximatelyEquals(a.getStartPoint(), b.getStartPoint(), permissiveness) && approximatelyEquals(a.getEndPoint(), b.getEndPoint(), permissiveness)) return true; if (approximatelyEquals(a.getStartPoint(), b.getEndPoint(), permissiveness) && approximatelyEquals(a.getEndPoint(), b.getStartPoint(), permissiveness)) return true; return false; } boolean approximatelyEquals(Vector a, Vector b, float permissiveness) { return a.subtract(b).length() < permissiveness; } boolean selectedA = false; boolean selectedB = false; final LineSegment diagonalA, diagonalB; }

<em>(Inner class in <a href="https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/extract/CheckBoxExtractionStrategy.java#L34" rel="nofollow">CheckBoxExtractionStrategy.java</a>)</em>

Applying it like this to the sample document:

for (int page = 1; page <= pdfReader.getNumberOfPages(); page++) { System.out.printf("\nPage %s\n====\n", page); CheckBoxExtractionStrategy strategy = new CheckBoxExtractionStrategy(); PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader); parser.processContent(page, strategy); for (Box box : strategy.getBoxes()) { Vector basePoint = box.getDiagonal().getStartPoint(); System.out.printf("at %s, %s - %s\n", basePoint.get(Vector.I1), basePoint.get(Vector.I2), box.isChecked() ? "checked" : "unchecked"); } }

one gets the output

<blockquote> Page 1 ==== at 73.104, 757.8 - checked at 86.544, 757.8 - checked at 99.984, 757.8 - unchecked </blockquote>

for the OP's document

<a href="https://i.stack.imgur.com/Y88UL.png" rel="nofollow"><img alt="Screenshot of Doc1.pdf" class="b-lazy" data-src="https://i.stack.imgur.com/Y88UL.png" data-original="https://i.stack.imgur.com/Y88UL.png" src="https://etrip.eimg.top/images/2019/05/07/timg.gif" /></a>


