Question

在Unix或Windows中，我想将此dictionary转换为Python dictionary。我复制了PDF字典的内容，并将它们放在.rtf文件中，打算使用Python read。但是，它提供了类似的内容：

A /e/名词是ABO系统的人血型，含有A抗原（注意：A型的一个人可以捐赠给同一组或AB组的人，并且可以接受   来自A型或O型人的血液。）
  AA
  腹胀/bdɒmn（ə）ldstenʃ（ə）n /名词一个条件，其中abdo-
  男人因气体或液体而伸展   一个
  腹胀   AA abbr嗜酒者匿名

它基本上将PDF中的列压缩成奇怪的错误。如何将PDF转换为文本以使列受到尊重？换句话说，所需的输出是：

A /e/名词是ABO系统的人血型，含有A抗原（注意：A型的一个人可以捐赠给同一组或AB组的人，并且可以接受   来自A型或O型人的血液。）
  AA abbr嗜酒者匿名

......等等

Answer 1

您基本上有两个选项可以访问该文本：

按原样从每个页面直接提取文本。
沿着列空间将每个页面拆分为两个，并分别从每一半中提取文本

对于第一个选项，我建议您先试用pdftotext，但使用参数-layout。（还有其他工具，例如 TET ，来自PDFlib人员的文本提取工具包，如果pdftotext不够好，您可以尝试使用它。）

要使用Ghostscript和其他工具跟随第二个选项的道路，您可能需要查看我对以下问题的回答：

Linux-based tool to chop PDFs into multiple pages （超级用户）
Convert PDF 2 sides per page to 1 side per page （超级用户）
How can I split a PDF's pages down the middle? （超级用户）
Cropping a PDF using Ghostscript 9.01 （Stackoverflow）
Split one PDF page into two （Stackoverflow）
PDF - Remove White Margins （Stackoverflow）

`pdftotext -layout`

您可以使用命令行工具pdftotext进行尝试。你必须根据自己的目的来决定它是否“足够好”。

以下命令仅从第8页（具有双列布局的第一页）中提取文本并将其打印到<stdout>：

$ pdftotext -f 8 -l 8 -layout                                         \
           Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \
 | head -n 30

结果：

Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM

                                                          A
 A /e/ noun a human blood type of the ABO                abdominal distension /bdɒmn(ə)l ds
 A                                                        abdominal distension
 system, containing the A antigen (NOTE: Some-              tenʃ(ə)n/ noun a condition in which the abdo-
 one with type A can donate to people of the              men is stretched because of gas or fluid
 same group or of the AB group, and can receive           abdominal pain /b dɒmn(ə)l pen/ noun
                                                          abdominal pain
 blood from people with type A or type O.)                pain in the abdomen caused by indigestion or
 AA
 AA abbr Alcoholics Anonymous                             more serious disorders
 A & E /e ənd  i
                     /, A & E department /e ənd           abdominal viscera /bdɒmn(ə)l    vsərə/
 A & E                                                    abdominal viscera
    i
      d pɑ
           tmənt/ noun same as accident and
                                                          plural noun the organs which are contained in
 emergency department                                     the abdomen, e.g. the stomach, liver and intes-
 A & E medicine /e ənd     i
                              med(ə)sn/
 A & E medicine
                                                          tines
                                                          abdominal wall /b dɒmn(ə)l wɔ
                                                                                        l/ noun
                                                          abdominal wall
 noun the medical procedures used in A & E de-                                                            
 partments                                                muscular tissue which surrounds the abdomen
                                                          abdomino- /bdɒmnəυ/ prefix referring to
                                                          abdomino-

请注意-layout的使用！没有它，提取的文本将如下所示：

Medicine.fm Page 1 2003年11月20日星期四下午4:26 一个 A /e/名词是ABO的人血型系统，含有A抗原（注：SomeA

一个A型可以捐赠给人们同组或AB组，可以收到来自A型或O型人的血液。） AA abbr酗酒者匿名 A＆amp; E /eə我 /，A＆amp; E部门/eənd 一世 dpɑ tmənt/ noun与事故相同急诊科 A＆amp; E医学/eənd MED（ə）sn/ 名词A＆amp; A中使用的医疗程序。 E deAA

A＆amp; Ë A＆amp; E药 partments公寓 AB /ebi /名词是人类的血型 ABO系统，含有A和B抗原 AB

我注意到该文件在第8页上使用，但尚未嵌入，字体Courier，Helvetica，Helvetica-Bold，Times-Roman和Times-Italic。

这不会对文本提取造成问题，因为所有这些字体都使用/WinAnsiEncoding。

但是，还有其他字体作为子集嵌入。这些字体使用/Custom编码，但它们不提供/ToUnicode表。此表是可靠的文本提取（将字形名称反向转换为字符名称）所必需的。

我在这张表中可以看到我所说的内容：

$ pdffonts -f 8 -l 8 Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf 
 name                           type        encoding      emb sub uni object ID
 ------------------------------ ----------- ------------- --- --- --- ---------
 Helvetica-Bold                 Type 1      WinAnsi       no  no  no    1505  0
 Courier                        Type 1      WinAnsi       no  no  no    1507  0
 Helvetica                      Type 1      WinAnsi       no  no  no    1497  0
 MOEKLA+Times-PhoneticIPA       Type 1C     Custom        yes yes yes   1509  0
 Times-Roman                    Type 1      WinAnsi       no  no  no    1506  0
 Times-Italic                   Type 1      WinAnsi       no  no  no    1499  0
 IGFBAL+EuropeanPi-Three        Type 1C     Custom        yes yes no    1502  0

碰巧我最近手工编写了5个不同的PDF文件，带有注释的源代码，用于新的GitHub项目。这5个文件证明了作为子集嵌入的每种字体的正确/ToUnicode表的重要性。它们可以在这里找到，还有一个解释更多细节的自述文件

的 https://github.com/angea/PDF101/tree/master/handcoded/textextract

Answer 2

您可以使用pdfminer从PDF中提取文字：http://www.unixuser.org/~euske/python/pdfminer/

Answer 3

PDF文档对文档结构的概念很少。 PDF内容流包括用于在页面上放置字形的指令，但是放置的顺序不必与文档结构相对应。

您没有说明您正在使用的平台。如果您使用的是OS X，则可以使用PDFKit来实现您想要的效果。

Answer 4

I have solved this issue with R. May be it has small bugs which can be corrected to your needs.

    countWhiteSpaces <-
  function(x)
    attr(gregexpr("(?<=[^ ])[ ]+(?=[^ ])", x, perl = TRUE)[[1]], "match.length")

getColumnCount <- function(path){
  library(pdftools)
  x <- pdf_text(path)
  write.csv(x,"data.txt")
  res <- readLines("data.txt")

  yy <- c()
  for(i in seq(1:length(res))){
    y = as.list(countWhiteSpaces(res[i]))
    yy[i]= length(y[y > 1])

  }
  li = list(colsInPdf= 1+as.integer(names(sort(table(yy), decreasing=T)[1])),lines = res)
  return(li)
}

result <- getColumnCount("pathToPdfFile.pdf")
lines <- result$lines
sizeOfText <- length(lines)
colsInPdf <- result$colsInPdf
df <- data.frame(matrix(ncol = result$colsInPdf, nrow = 0))
df <- df[1,]


for(i in seq(1:sizeOfText)){
  line = lines[i]
  y = as.list(countWhiteSpaces(line))
  yy = length(y[y > 1])
  t = as.list(strsplit(line, '\\s{2,}')[[1]])
  if(t[1]==""){t=t[-1]}
  t = unlist(t)
  if(length(t)==colsInPdf){
    df <- rbind(df, t)
  }

}
df = paste(df,collapse = " ")

Clean_String <- function(string){
  # Lowercase
  temp <- tolower(string)
  # Remove everything that is not a number or letter (may want to keep more 
  # stuff in your actual analyses). 
  temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
  # Shrink down to just one white space
  temp <- stringr::str_replace_all(temp,"[\\s]+", " ")
  # Split it
  temp <- stringr::str_split(temp, " ")[[1]]
  temp <- gsub(",", " ",temp)
  # Get rid of trailing "" if necessary
  indexes <- which(temp == "")
  if(length(indexes) > 0){
    temp <- temp[-indexes]
  } 
  return(temp)
}

toString(Clean_String(df))

将PDF转换为文本

4 个答案:

pdftotext -layout

`pdftotext -layout`