我有很多PDF文件。在其中一些中,我可以轻松地将文本从PDF复制/粘贴到任何文本编辑器。在其他情况下,复制/粘贴仅产生垃圾(奇怪的,不可读的字符)。据我了解,这是因为嵌入式字体和/或自定义编码(但也许我错了)。
我选择了10个PDF并使用pdffonts
来提取字体相关信息。从以c开头的PDF(正确)文本可以复制,从那些以w开头(错误)文本无法复制。 pdffonts
命令的输出如下。
我是否可以通过自定义ecoding识别错误的文档?换句话说,如果有自定义编码,则无法从PDF复制/粘贴文本?
./comparison/c1.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DDDWSC+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 9
DDDWSC+MyriadPro-Bold CID Type 0C Identity-H yes yes yes 18
XPQSAJ+MinionPro-Regular CID Type 0C Identity-H yes yes yes 36
QQNHBI+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 121
MyriadPro-Regular Type 1C (OT) WinAnsi yes no no 82
./comparison/c2.pdf
name type encoding emb sub uni object I
------------------------------------ ----------------- ---------------- --- --- --- ---------
GBITER+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 9
GBITER+MyriadPro-Bold CID Type 0C Identity-H yes yes yes 18
TPIJNO+MinionPro-Regular CID Type 0C Identity-H yes yes yes 36
HCPLUP+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 99
CFAHCZ+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 100
MyriadPro-Regular Type 1C (OT) WinAnsi yes no no 82
./comparison/c3.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
FTWOKY+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 8
FTWOKY+MyriadPro-Bold CID Type 0C Identity-H yes yes yes 9
HDAKMN+MinionPro-Regular CID Type 0C Identity-H yes yes yes 34
CYRRXP+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 119
MyriadPro-Regular Type 1C (OT) WinAnsi yes no no 80
./comparison/c4.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
TimesNewRoman CID TrueType Identity-H yes no yes 8
TimesNewRoman,Bold CID TrueType Identity-H yes no yes 9
TimesNewRoman,BoldItalic CID TrueType Identity-H yes no yes 30
TimesNewRomanPSMT TrueType WinAnsi no no no 10
TimesNewRomanPS-BoldMT TrueType WinAnsi no no no 31
TimesNewRomanPS-BoldItalicMT TrueType WinAnsi no no no 32
Arial-BoldItalicMT TrueType WinAnsi no no no 33
CPWIYN+MinionPro-Regular CID Type 0C Identity-H yes yes yes 56
PZAZAE+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 120
MyriadPro-Regular Type 1C (OT) WinAnsi yes no no 102
./comparison/c5.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- -------
TimesNewRoman CID TrueType Identity-H yes no yes 9
TimesNewRoman,Bold CID TrueType Identity-H yes no yes 10
TimesNewRomanPSMT TrueType WinAnsi no no no 11
TimesNewRomanPS-BoldMT TrueType WinAnsi no no no 12
PKLOUG+MinionPro-Regular CID Type 0C Identity-H yes yes yes 43
ZWNFNP+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 120
MyriadPro-Regular Type 1C (OT) WinAnsi yes no no 89
./comparison/w1.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
ECCDLD+TimesNewRomanPSMT Type 1C WinAnsi yes yes no 5
ECCDMD+TimesNewRoman Type 1C Custom yes yes no 6
ECCDNE+TimesNewRomanPS-BoldMT Type 1C WinAnsi yes yes no 7
ECCDNF+TimesNewRoman,Bold Type 1C Custom yes yes no 8
MinionPro-Regular-Identity-H CID Type 0C Identity-H yes no no 24
./comparison/w2.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
DIKJDI+TimesNewRoman,Bold Type 1C Custom yes yes no 5
DIKJEJ+TimesNewRomanPS-BoldMT Type 1C WinAnsi yes yes no 6 0
DIKJEK+TimesNewRomanPSMT Type 1C WinAnsi yes yes no 7
DIKJEL+TimesNewRoman Type 1C Custom yes yes no 8
MinionPro-Regular-Identity-H CID Type 0C Identity-H yes no no 22
./comparison/w3.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
LLHACL+Calibri Type 1C Custom yes yes yes 5
LLHACM+Calibri-Bold Type 1C Custom yes yes yes 6
LLHBBI+Calibri-Italic Type 1C Custom yes yes yes 20
MinionPro-Regular-Identity-H CID Type 0C Identity-H yes no no 21
./comparison/w4.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------EPGNDG+TimesNewRoman Type 1C Custom yes yes no 5
EPGNDH+TimesNewRomanPSMT Type 1C WinAnsi yes yes no 6
EPGNDI+TimesNewRomanPS-BoldMT Type 1C WinAnsi yes yes no 7
EPGNGI+TimesNewRoman,Bold Type 1C Custom yes yes no 8
MinionPro-Regular-Identity-H CID Type 0C Identity-H yes no no 19
OXKXLW+MyriadPro-Regular CID Type 0C Identity-H yes yes yes 60
MyriadPro-Regular Type 1C WinAnsi yes no no 52
./comparison/w5.pdf
name type encoding emb sub uni object
------------------------------------ ----------------- ---------------- --- --- --- --------
JPDEFN+TimesNewRoman Type 1C Custom yes yes no 5
JPDEHN+TimesNewRomanPSMT Type 1C WinAnsi yes yes no 6
JPDEIN+TimesNewRomanPS-BoldMT Type 1C WinAnsi yes yes no 7
JPDEJO+TimesNewRoman,Bold Type 1C Custom yes yes no 8 MinionPro-Regular-Identity-H CID Type 0C Identity-H yes no no 25