12345

How to Detect table start in itextSharp?

Question:

I am trying to convert pdf to csv file. pdf file has data in tabular format with first row as header. I have reached to the level where I can extract text from a cell, compare the baseline of text in table and detect newline but I need to compare table borders to detect start of table. I do not know how to detect and compare lines in PDF. Can anyone help me?

Thanks!!!

Answer1:

As you've seen (hopefully), PDFs have no concept of tables, just text placed at specific locations and lines drawn around them. There is no internal relationship between the text and the lines. This is very important to understand.

Knowing this, if all of the cells have enough padding you can look for gaps between characters that are large enough such as the width of 3 or more spaces. If the cells don't have enough spacing this will unfortunately probably break.

You could also look at every line in the PDF and try to figure out what represents your "table-like" lines. See <a href="https://stackoverflow.com/a/15638732/231316" rel="nofollow">this answer for how to walk every token on a page</a> to see what's being drawn.

Answer2:

I was also searching the answer for the similar question, but unfortunately I didn't found one so I did it on my own.

Here is the github link for the dotnet Console Application I made. <a href="https://github.com/Justabhi96/Detect_And_Extract_Table_From_Pdf" rel="nofollow">https://github.com/Justabhi96/Detect_And_Extract_Table_From_Pdf</a>

This application detects the table in the specific page of the PDF and prints them in a table format on the console. Here is the code that i used to make this application.

First of all I took the text out of PDF along with their coordinates using a class which extends <strong>iTextSharp.text.pdf.parser.LocationTextExtractionStrategy</strong> class of iTextSharp. The Code is as follows:

This is the Class that is going to store the chunks with there coordinates and text.

using System; using System.Collections.Generic; using System.Linq; using System.Web; namespace itextPdfTextCoordinates { public class RectAndText { public iTextSharp.text.Rectangle Rect; public String Text; public RectAndText(iTextSharp.text.Rectangle rect, String text) { this.Rect = rect; this.Text = text; } } }

And this is the class that extends the <strong>LocationTextExtractionStrategy</strong> class.

using iTextSharp.text.pdf.parser; using System; using System.Collections.Generic; using System.Linq; using System.Web; namespace itextPdfTextCoordinates { public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy { public List<RectAndText> myPoints = new List<RectAndText>(); //Automatically called for each chunk of text in the PDF public override void RenderText(TextRenderInfo renderInfo) { base.RenderText(renderInfo); //Get the bounding box for the chunk of text var bottomLeft = renderInfo.GetDescentLine().GetStartPoint(); var topRight = renderInfo.GetAscentLine().GetEndPoint(); //Create a rectangle from it var rect = new iTextSharp.text.Rectangle( bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2] ); //Add this to our main collection this.myPoints.Add(new RectAndText(rect, renderInfo.GetText())); } } }

This class is overriding the <strong>RenderText</strong> method of the LocationTextExtractionStrategy class which will be called each time you extract the chunks from a PDF page using <em>PdfTextExtractor.GetTextFromPage()</em> method.

using itextPdfTextCoordinates; using iTextSharp.text.pdf; //Create an instance of our strategy var t = new MyLocationTextExtractionStrategy(); var path = "F:\\sample-data.pdf"; //Parse page 1 of the document above using (var r = new PdfReader(path)) { for (var i = 1; i <= r.NumberOfPages; i++) { // Calling this function adds all the chunks with their coordinates to the // 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, i, t); } } //Here you can loop over the chunks of PDF foreach(chunk in t.myPoints){ Console.WriteLine("character {0} is at {1}*{2}",i.Text,i.Rect.Left,i.Rect.Top); }

Now for Detecting the start and end of the table you can use the coordinates of the chunks extracted from the PDF. <strong>Like if the specific line is not having table then there will be no jumps in the right coordinate of the current chunk and and Left coordinate of next chunk. But the lines having table will be having those coordinate jumps of at least 3 points.</strong>

Like for Lines having table will have coordinates of chunks something like this:

right coord of current chunk -> 12.75pts<br /> left coords of next chunk -> 20.30pts

so further you can use this logic to detect tables in the PDF. The code is as follows:

using itextPdfTextCoordinates; using iTextSharp.text.pdf; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace ConsoleApp1 { class LineUsingCoordinates { public static List<List<string>> getLineText(string path, int page, float[] coord) { //Create an instance of our strategy var t = new MyLocationTextExtractionStrategy(); //Parse page 1 of the document above using (var r = new PdfReader(path)) { // Calling this function adds all the chunks with their coordinates to the // 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, page, t); } // List of columns in one line List<string> lineWord = new List<string>(); // temporary list for working around appending the <List<List<string>> List<string> tempWord; // List of rows. rows are list of string List<List<string>> lineText = new List<List<string>>(); // List consisting list of chunks related to each line List<List<RectAndText>> lineChunksList = new List<List<RectAndText>>(); //List consisting the chunks for whole page; List<RectAndText> chunksList; // List consisting the list of Bottom coord of the lines present in the page List<float> bottomPointList = new List<float>(); //Getting List of Coordinates of Lines in the page no matter it's a table or not foreach (var i in t.myPoints) { Console.WriteLine("character {0} is at {1}*{2}", i.Text, i.Rect.Left, i.Rect.Top); // If the coords passed to the function is not null then process the part in the // given coords of the page otherwise process the whole page if (coord != null) { if (i.Rect.Left >= coord[0] && i.Rect.Bottom >= coord[1] && i.Rect.Right <= coord[2] && i.Rect.Top <= coord[3]) { float bottom = i.Rect.Bottom; if (bottomPointList.Count == 0) { bottomPointList.Add(bottom); } else if (Math.Abs(bottomPointList.Last() - bottom) > 3) { bottomPointList.Add(bottom); } } } // else process the whole page else { float bottom = i.Rect.Bottom; if (bottomPointList.Count == 0) { bottomPointList.Add(bottom); } else if (Math.Abs(bottomPointList.Last() - bottom) > 3) { bottomPointList.Add(bottom); } } } // Sometimes the above List will be having some elements which are from the same line but are // having different coordinates due to some characters like " ",".",etc. // And these coordinates will be having the difference of at most 4 points between // their bottom coordinates. //so to remove those elements we create two new lists which we need to remove from the original list //This list will be having the elements which are having different but a little difference in coordinates List<float> removeList = new List<float>(); // This list is having the elements which are having the same coordinates List<float> sameList = new List<float>(); // Here we are adding the elements in those two lists to remove the elements // from the original list later for (var i = 0; i < bottomPointList.Count; i++) { var basePoint = bottomPointList[i]; for (var j = i+1; j < bottomPointList.Count; j++) { var comparePoint = bottomPointList[j]; //here we are getting the elements with same coordinates if (Math.Abs(comparePoint - basePoint) == 0) { sameList.Add(comparePoint); } // here ae are getting the elements which are having different but the diference // of less than 4 points else if (Math.Abs(comparePoint - basePoint) < 4) { removeList.Add(comparePoint); } } } // Here we are removing the matching elements of remove list from the original list bottomPointList = bottomPointList.Where(item => !removeList.Contains(item)).ToList(); //Here we are removing the first matching element of same list from the original list foreach (var r in sameList) { bottomPointList.Remove(r); } // Here we are getting the characters of the same line in a List 'chunkList'. foreach (var bottomPoint in bottomPointList) { chunksList = new List<RectAndText>(); for (int i = 0; i < t.myPoints.Count; i++) { // If the character is having same bottom coord then add it to chunkList if (bottomPoint == t.myPoints[i].Rect.Bottom) { chunksList.Add(t.myPoints[i]); } // If character is having a difference of less than 3 in the bottom coord then also // add it to chunkList because the coord of the next line will differ at least 10 points // from the coord of current line else if (Math.Abs(t.myPoints[i].Rect.Bottom - bottomPoint) < 3) { chunksList.Add(t.myPoints[i]); } } // Here we are adding the chunkList related to each line lineChunksList.Add(chunksList); } bool sameLine = false; //Here we are looping through the lines consisting the chunks related to each line foreach(var linechunk in lineChunksList) { var text = ""; // Here we are looping through the chunks of the specific line to put the texts // that are having a cord jump in their left coordinates. // because only the line having table will be having the coord jumps in their // left coord not the line having texts for (var i = 0; i< linechunk.Count-1; i++) { // If the coord is having a jump of less than 3 points then it will be in the same // column otherwise the next chunk belongs to different column if (Math.Abs(linechunk[i].Rect.Right - linechunk[i + 1].Rect.Left) < 3) { if (i == linechunk.Count - 2) { text += linechunk[i].Text + linechunk[i+1].Text ; } else { text += linechunk[i].Text; } } else { if (i == linechunk.Count - 2) { // add the text to the column and set the value of next column to "" text += linechunk[i].Text; // this is the list of columns in other word its the row lineWord.Add(text); text = ""; text += linechunk[i + 1].Text; lineWord.Add(text); text = ""; } else { text += linechunk[i].Text; lineWord.Add(text); text = ""; } } } if(text.Trim() != "") { lineWord.Add(text); } // creating a temporary list of strings for the List<List<string>> manipulation tempWord = new List<string>(); tempWord.AddRange(lineWord); // "lineText" is the type of List<List<string>> // this is our list of rows. and rows are List of strings // here we are adding the row to the list of rows lineText.Add(tempWord); lineWord.Clear(); } return lineText; } } }

You can call <strong>getLineText()</strong> method of the above class and run the following loop to see the output in the table structure on the console.

var testFile = "F:\\sample-data.pdf"; float[] limitCoordinates = { 52, 671, 357, 728 };//{LowerLeftX,LowerLeftY,UpperRightX,UpperRightY} // This line gives the lists of rows consisting of one or more columns //if you pass the third parameter as null the it returns the content for whole page // but if you pass the coordinates then it returns the content for that coords only var lineText = LineUsingCoordinates.getLineText(testFile, 1, null); //var lineText = LineUsingCoordinates.getLineText(testFile, 1, limitCoordinates); // For detecting the table we are using the fact that the 'lineText' item which length is // less than two is surely not the part of the table and the item which is having more than // 2 elements is the part of table foreach (var row in lineText) { if (row.Count > 1) { for (var col = 0; col < row.Count; col++) { string trimmedValue = row[col].Trim(); if (trimmedValue != "") { Console.Write("|" + trimmedValue + "|"); } } Console.WriteLine(""); } } Console.ReadLine();

Recommend

  • making a select dropdown from database in codeigniter
  • Backbone subview event doesn't fire, access to this.model lost
  • Dot rules in nested conditional statements - COBOL
  • Access VBA To Send Query Results to Outlook Email in Table Format
  • awk pattern match - substr field action issue
  • UTF-8 vs UTF8 in XML files
  • Reuse Hadoop code in Spark efficiently?
  • Android Things - can't use Awareness API
  • Getting ALL permutations of ALL sublists of a list of integers
  • Python unicode equal comparison failed in terminal but working under Spyder editor
  • pagination classic asp, and button changed to hyperlinks
  • Does Aptana 3 recognize CSS3?
  • C# Linq to CSV Dynamic Object runtime column name
  • Element Order being Messed Up from after Moving Elements Twice
  • W3C XML Schema and the maximum integer for maxOccurs
  • Making a ball/circle move in cocoa using nsbezierpath(objective -c))?
  • How to return a set of NUMBERS from a Function in PLSQL and then use it in a FOR LOOP?
  • CSS animation do not work for svg in
  • Randomly placing a polygon inside of polygon
  • Update data in d3.js group
  • Getting coordinates of a component in java
  • Circle movement upon rectangle Collision
  • Inserting a (g) node in the middle of a tree (SVG) using jQuery
  • iOS 9 errors and correct conversion to swift 2
  • c# winform DrawToBitmap offscreen
  • Why people use prototype in javascript when it is easy to inherit using apply () and call () methods
  • What command do i need to pass in SabreCommandLLSRQ to get current price of PNR?
  • Validaiting emails with Net.Mail MailAddress
  • MySQL WHERE-condition in procedure ignored
  • Web-crawler for facebook in python
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • trying to dynamically update Highchart column chart but series undefined
  • How do I configure my settings file to work with unit tests?
  • Turn off referential integrity in Derby? is it possible?
  • apache spark aggregate function using min value
  • Add sale price programmatically to product variations
  • Sorting a 2D array using the second column C++
  • Unable to use reactive element in my shiny app
  • java string with new operator and a literal
  • How do I use LINQ to get all the Items that have a particular SubItem?