40078

How to extract page number from PDF file [closed]

We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.

Answer1:

The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.

Allow me to predict your response.

*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"

Well yes, you can see that page number with your eyes and your <strong>human</strong> intelligence, but to a <strong>machine</strong> that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!

If you know something about PDF, I can predict your next reply.

"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"

Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.

"Then what are these page labels about?" you ask.

Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.

This is the long answer. The short answer is simple: <strong>You are asking for something that is impossible</strong> (in general, not only with iText, Tika, PdfBox, or any other tool you might try).

Recommend

  • Stream webcam via. Wowza streaming server
  • How iperf calculates network statistics
  • Using a support vector classifier with polynomial kernel in scikit-learn
  • Flex: DataGrid column formatting of numbers
  • How to set repeating alarm using setExact and how to cancel the same?
  • get equation of linear SVM regression line
  • Wrong Range Rate with Pyephem
  • How can I write a where clause in SQL to filter a DATETIME column by the time of day?
  • Timer once a minute on the minute
  • Firestore - Checking The Connection Status Of The Module To The Server
  • Raise Session_OnStart event from custom ASP.NET SessionStateProvider class
  • How to only store 3 values for a key in a dictionary? Python
  • Paramiko SSHException Channel Closed
  • python - calculate orthographic similarity between words of a list
  • What is RSL (Runtime shared library ) used for in flash?
  • Parallel sieve of Eratosthenes - Java Multithreading
  • Iterate twice through a DataReader
  • Python to parent/child JSON
  • Can my PDF ping my server when it is opened?
  • Multiple producers single consumer locking schema
  • Updating both a ConcurrentHashMap and an AtomicInteger safely
  • Ajax Upload File: $_FILES is empty but files exists in request header
  • Code in Job's Script Block after Start-Process Does not Execute
  • How do I get HTML corresponding to current DOM tree?
  • Xcode 4 NSLog Macro link in Xcode 3
  • How to define and use opencv mat of user type
  • JQuery Internet Explorer and ajaxstop
  • JSON response opens as a file, but I can't access it with JavaScript
  • Projection media query: browser support and workarounds?
  • How to redirect a user to a different server and include HTTP basic authentication credentials?
  • Can I make an Android app that runs a web view in Chrome 39?
  • Change an a tag attribute in JavaScript based on screen width
  • Is there a mandatory requirement to switch app.yaml?
  • File upload with ng-file-upload throwing error
  • ExecuteAsync RestSharp to allow backgroundWorker CancellationPending c#
  • AngularJs get employee from factory
  • File not found error Google Drive API
  • Qt: Run a script BEFORE make
  • LevelDB C iterator
  • Linking SubReports Without LinkChild/LinkMaster