获取有关文本是否可从PDF中提取的信息

时间:2013-09-12 11:10:29

标签: pdf encoding fonts text-extraction

我有很多PDF文件。在其中一些中,我可以轻松地将文本从PDF复制/粘贴到任何文本编辑器。在其他情况下,复制/粘贴仅产生垃圾(奇怪的,不可读的字符)。据我了解,这是因为嵌入式字体和/或自定义编码(但也许我错了)。

我选择了10个PDF并使用pdffonts来提取字体相关信息。从以c开头的PDF(正确)文本可以复制,从那些以w开头(错误)文本无法复制。 pdffonts命令的输出如下。

我是否可以通过自定义ecoding识别错误的文档?换句话说,如果有自定义编码,则无法从PDF复制/粘贴文本?

./comparison/c1.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DDDWSC+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes      9  
DDDWSC+MyriadPro-Bold                CID Type 0C       Identity-H       yes yes yes     18  
XPQSAJ+MinionPro-Regular             CID Type 0C       Identity-H       yes yes yes     36  
QQNHBI+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes    121  
MyriadPro-Regular                    Type 1C (OT)      WinAnsi          yes no  no      82  
./comparison/c2.pdf

name                                 type              encoding         emb sub uni object I
------------------------------------ ----------------- ---------------- --- --- --- ---------
GBITER+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes      9  
GBITER+MyriadPro-Bold                CID Type 0C       Identity-H       yes yes yes     18  
TPIJNO+MinionPro-Regular             CID Type 0C       Identity-H       yes yes yes     36  
HCPLUP+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes     99  
CFAHCZ+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes    100  
MyriadPro-Regular                    Type 1C (OT)      WinAnsi          yes no  no      82  

./comparison/c3.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- --------
FTWOKY+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes      8  
FTWOKY+MyriadPro-Bold                CID Type 0C       Identity-H       yes yes yes      9  
HDAKMN+MinionPro-Regular             CID Type 0C       Identity-H       yes yes yes     34  
CYRRXP+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes    119  
MyriadPro-Regular                    Type 1C (OT)      WinAnsi          yes no  no      80  

./comparison/c4.pdf
name                                 type              encoding         emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
TimesNewRoman                        CID TrueType      Identity-H       yes no  yes      8  
TimesNewRoman,Bold                   CID TrueType      Identity-H       yes no  yes      9  
TimesNewRoman,BoldItalic             CID TrueType      Identity-H       yes no  yes     30  
TimesNewRomanPSMT                    TrueType          WinAnsi          no  no  no      10  
TimesNewRomanPS-BoldMT               TrueType          WinAnsi          no  no  no      31  
TimesNewRomanPS-BoldItalicMT         TrueType          WinAnsi          no  no  no      32  
Arial-BoldItalicMT                   TrueType          WinAnsi          no  no  no      33  
CPWIYN+MinionPro-Regular             CID Type 0C       Identity-H       yes yes yes     56  
PZAZAE+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes    120  
MyriadPro-Regular                    Type 1C (OT)      WinAnsi          yes no  no     102  

./comparison/c5.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- -------
TimesNewRoman                        CID TrueType      Identity-H       yes no  yes      9  
TimesNewRoman,Bold                   CID TrueType      Identity-H       yes no  yes     10  
TimesNewRomanPSMT                    TrueType          WinAnsi          no  no  no      11  
TimesNewRomanPS-BoldMT               TrueType          WinAnsi          no  no  no      12  
PKLOUG+MinionPro-Regular             CID Type 0C       Identity-H       yes yes yes     43  
ZWNFNP+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes    120  
MyriadPro-Regular                    Type 1C (OT)      WinAnsi          yes no  no      89  

./comparison/w1.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- --------
ECCDLD+TimesNewRomanPSMT             Type 1C           WinAnsi          yes yes no       5  
ECCDMD+TimesNewRoman                 Type 1C           Custom           yes yes no       6  
ECCDNE+TimesNewRomanPS-BoldMT        Type 1C           WinAnsi          yes yes no       7  
ECCDNF+TimesNewRoman,Bold            Type 1C           Custom           yes yes no       8  
MinionPro-Regular-Identity-H         CID Type 0C       Identity-H       yes no  no      24  

./comparison/w2.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- --------
DIKJDI+TimesNewRoman,Bold            Type 1C           Custom           yes yes no       5  
DIKJEJ+TimesNewRomanPS-BoldMT        Type 1C           WinAnsi          yes yes no       6 0
DIKJEK+TimesNewRomanPSMT             Type 1C           WinAnsi          yes yes no       7  
DIKJEL+TimesNewRoman                 Type 1C           Custom           yes yes no       8  
MinionPro-Regular-Identity-H         CID Type 0C       Identity-H       yes no  no      22  

./comparison/w3.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- --------
LLHACL+Calibri                       Type 1C           Custom           yes yes yes      5  
LLHACM+Calibri-Bold                  Type 1C           Custom           yes yes yes      6  
LLHBBI+Calibri-Italic                Type 1C           Custom           yes yes yes     20  
MinionPro-Regular-Identity-H         CID Type 0C       Identity-H       yes no  no      21  

./comparison/w4.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- --------EPGNDG+TimesNewRoman                 Type 1C           Custom           yes yes no       5  
EPGNDH+TimesNewRomanPSMT             Type 1C           WinAnsi          yes yes no       6  
EPGNDI+TimesNewRomanPS-BoldMT        Type 1C           WinAnsi          yes yes no       7  
EPGNGI+TimesNewRoman,Bold            Type 1C           Custom           yes yes no       8  
MinionPro-Regular-Identity-H         CID Type 0C       Identity-H       yes no  no      19  
OXKXLW+MyriadPro-Regular             CID Type 0C       Identity-H       yes yes yes     60  
MyriadPro-Regular                    Type 1C           WinAnsi          yes no  no      52  

./comparison/w5.pdf
name                                 type              encoding         emb sub uni object 
------------------------------------ ----------------- ---------------- --- --- --- --------
JPDEFN+TimesNewRoman                 Type 1C           Custom           yes yes no       5  
JPDEHN+TimesNewRomanPSMT             Type 1C           WinAnsi          yes yes no       6  
JPDEIN+TimesNewRomanPS-BoldMT        Type 1C           WinAnsi          yes yes no       7  
JPDEJO+TimesNewRoman,Bold            Type 1C           Custom           yes yes no       8  MinionPro-Regular-Identity-H         CID Type 0C       Identity-H       yes no  no      25  

0 个答案:

没有答案