24907

What to do with CIDs in text extracted by PDFMiner?

Question:

I've some PDFs which are in Hindi, and have extractable text. I used <a href="https://github.com/pdfminer/pdfminer.six" rel="nofollow">pdfminer.six</a> for python 3.6, to do the extraction. The output looks like:<br /><a href="https://i.stack.imgur.com/cZGKz.png" rel="nofollow"><img alt="enter image description here" class="b-lazy" data-src="https://i.stack.imgur.com/cZGKz.png" data-original="https://i.stack.imgur.com/cZGKz.png" src="https://etrip.eimg.top/images/2019/05/07/timg.gif" /></a>

As one can see, there are a number of characters that are converted into the form "(cid :number)".

On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table.

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

Moreover, according to a comment to <a href="https://stackoverflow.com/q/22908556/5345646" rel="nofollow">this</a> similar question, this process may not be legal. But I'm not trying to steal someone's font. I want the text. How does this process become illegal?

Since there are many questions like this one, I want to clarify that I do not aim at solving the "cid" problem. I want to clarify the reasons for the problem and reasons for it's illegality.

<strong>EDIT:</strong> <a href="https://github.com/euske/pdfminer/issues/39" rel="nofollow">This <em>issues</em> page</a> for pdfminer discusses this issue, where the author clearly says that there seems to be no reliable workaround for this issue. Is there some general, basic limitation (like, no access to font) that makes this issue continual?

Answer1:

<blockquote>

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

</blockquote>

The character codes one finds in the PDF content streams do not need to be related to Unicode values in any obvious way. In particular, a PDF viewer does not at all need a Unicode code point for a character code to show the matching glyph.

In a PDF a font has a mapping (or a sequence of mappings) from character code to glyph ID in the font program, and this mapping may be completely arbitrary.

E.g. in case of embedded font subsets the subset font program often is created by giving the first glyph used on a page a starting glyph id <em>n</em>, then giving the second, different glyph on that page id <em>n+1</em>, then the next, different glyph id <em>n+2</em> etc etc, and then the character codes often are identical to the glyph ids, i.e. the mapping above is the identity mapping. If there are no additional information anymore, a text extractor has no chance to properly do its job.

<blockquote>

I want to clarify the reasons for the problem

</blockquote>

Regular text extraction usually has the following options to find the Unicode value for a character code:

<ul><li>

A PDF font <em>may</em> include a <strong>ToUnicode</strong> map (mapping from character code to Unicode) to support operations like searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor needs.

Beware, though: these <strong>ToUnicode</strong> maps can be incomplete and sometimes even contain intentionally incorrect mappings!

</li> <li>

The PDF font encoding definition may contain the name of a pre-defined standard encoding (e.g. <strong>WinAnsiEncoding</strong> or <strong>GBpc-EUC-H</strong>) or a standardized character name (e.g. <strong>space</strong>, <strong>seven</strong>, or <strong>ntilde</strong>) for a given code. A text extractor merely needs to know the encoding represented by that encoding name or the code represented by that character name.

But the <strong>Encoding</strong> may also be an identity (<strong>Identity–H</strong> and <strong>Identity–V</strong> with <em>character code = glyph code</em>) which doesn't give away anything, and a character name may also be non-standardized (e.g. <strong>g17</strong>).

</li> </ul>

The PDF specification says: <em>If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.</em>

In case of your text extraction output I would guess the PDF font has an incomplete <strong>ToUnicode</strong> map.

Actually there are some more locations to look for additional information, e.g. the font program might include an own mapping of its glyphs to Unicode, but those additional information also are optional.

<blockquote>

... and reasons for it's illegality.

</blockquote>

In case of all the above options I don't see any sensible font license being violated, in particular as most of those options didn't even look into the font program (e.g. the *.ttf) itself, merely into the PDF metadata wrapping it.

On the other hand, if e.g. you had the idea to construct <strong>ToUnicode</strong> maps for those fonts missing such a map by drawing each glyph of the font onto a bitmap, nicely separated from anything else, and applying OCR to it, you as the recipient of the PDF suddenly would use the font program to draw something else than the original document, and this might be considered usage not covered by the license.

Recommend

  • DSPACK example for converting audio sample rate on the fly?
  • Rounding returned float values from a DB to their 'correct' values
  • How to check whether a command can be executed?
  • SWT: FontDialog without color selector
  • Implementing ignoredProperties() on both a Object subclass and its own subclass
  • Object Browser can't browse my own solution?
  • Are there any fluent WPF projects? [closed]
  • Default parameter using another parameter
  • Assignment of Allocatables of Different Shapes in Fortran [duplicate]
  • Mongoose TypeError: Cannot use 'in' operator to search for '_id' in
  • Reloading table causes flickering
  • Can you build a truly RESTful service that takes many parameters?
  • Should I be afraid to use UDP to make a client/server broadcast talk?
  • custom string delimiters stringtemplate-4
  • Why isn't my “Fizz Buzz” test in R working?
  • Python PIL to extract number from image
  • ZipList with Scalaz
  • Implement JwtBearer Authentication in NSwag SwaggerUi
  • CodeIgniter URI Parameter is partially bypassing an “if” statement
  • opencv display image without x server
  • xcode don't localize specific strings
  • Do I need to seed any random number generator before using EVP_PKEY_keygen of OpenSSL?
  • Does it make sense to call System.gc() and Thread.sleep() when working on Bitmaps?
  • azure media services - The request body is too large and exceeds the maximum permissible limit
  • Asynchronous UI Testing in Xcode With Swift
  • Display issues when we change from one jquery mobile page to another in firefox
  • Different response to non-authenticated users and AJAX calls
  • htaccess rewriting URLs with multiple forward slashes
  • Arrow is showed instead of the material design version hamburger icon. Why doesn't syncState in
  • How can I use Kendo UI with Razor?
  • ActionScript 2 vs ActionScript 3 performance
  • Display Images one by one with next and previous functionality
  • Web-crawler for facebook in python
  • Apache 2.4 - remove | delete | uninstall
  • Data Validation Drop Down Box Arrow Disappearing
  • A cron job substitute?
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • Unit Testing MVC Web Application in Visual Studio and Problem with QTAgent
  • Benchmarking RAM performance - UWP and C#
  • What are the advantages and disadvantages of reading an entire file into a single String as opposed