如何从R中的tm包导入的pdf中提取带有特定标题的文本?

时间:2018-08-30 14:27:58

标签: r dataset tm stringr

我正在使用tm包在R中导入多个pdf。我需要从pdf的内容中提取一些包含标题“企业信息”的特征向量。问题是双重的。首先,我无法使用此标题提取向量。其次,此向量以非常混乱的方式出现。我真的不能将人员的名字与公司所担任的职务相关联。这是我尝试构建的数据集的类型。我在下面展示一个例子。欢迎任何帮助。

vector_of_interest <- c("   CORPORATE INFORMATION\r\n   BOARD OF DIRECTORS                 REGISTERED OFFICE\r\n   Chuah Ah Bee                       Suite 12-02,12th Floor\r\n   Executive Chairman                 Menara Zurich\r\n   Chuah Hoon Phong                   170 Jalan Argyll, 10050 Penang\r\n   Group Managing Director            Telephone Number : 04-2296 318\r\n   Chan Kim Keow                      Facsimile Number : 04-2282 118\r\n   Executive Director\r\n   Loo Choo Gee\r\n   Executive Director                 COMPANY SECRETARIES\r\n   Chew Chee Khong\r\n   Executive Director                 Gunn Chit Geok\r\n   Ng Seng Bee                        (MAICSA 0673097)\r\n   Independent Non-Executive Director Chew Siew Cheng\r\n   Haji Ahmad Fazil Bin Haji Hashim   (MAICSA 7019191)\r\n   Independent Non-Executive Director\r\n   Goh Choon Aik\r\n   Independent Non-Executive Director SHARE REGISTRAR\r\n                                      Tricor Investor Services Sdn Bhd\r\n   AUDIT COMMITTEE                    Level 17, The Gardens North Tower\r\n                                      Mid Valley City\r\n   Ng Seng Bee                        Lingkaran Syed Putra\r\n   Chairman                           59200 Kuala Lumpur\r\n   Haji Ahmad Fazil Bin Haji Hashim   Telephone Number : 03-2264 3883\r\n   Member                             Facsimile Number : 03-2282 1886\r\n   Goh Choon Aik\r\n   Member\r\n                                      STOCK EXCHANGE LISTING\r\n   REMUNERATION COMMITTEE             Main Market of Bursa Malaysia Securities Berhad\r\n                                      Stock Code : 7174\r\n   Haji Ahmad Fazil Bin Haji Hashim   Stock Name : CAB\r\n   Chairman\r\n   Chuah Ah Bee\r\n   Member                             AUDITORS\r\n   Ng Seng Bee\r\n   Member                             Deloitte KassimChan\r\n                                      Chartered Accountants\r\n                                      4th Floor, Wisma Wang\r\n   NOMINATION COMMITTEE               251-A Jalan Burma\r\n                                      10350 Penang\r\n   Haji Ahmad Fazil Bin Haji Hashim\r\n   Chairman\r\n   Ng Seng Bee                        PRINCIPAL BANKERS\r\n   Member\r\n   Goh Choon Aik                      Malayan Banking Berhad\r\n   Member                             Hong Leong Bank Berhad\r\n                                      United Overseas Bank (Malaysia) Berhad\r\n10 CAB Annual Report 2012\r\n")

#my attempt
 library(tm)
 library(tidyverse)
 library(stringr)

 Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible. I have also tried to add engine = "xpdf", before control

 docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents
 document <- content(docs[[1]])
 corporate.info <- unlist(str_extract_all(document, "CORPORATE INFORMATION.+"))

可在以下链接中找到pdf: 信息在第10页

1 个答案:

答案 0 :(得分:0)

我找到了解决方法:

首先,我将默认的ReadPDF engine更改为xpdf

Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout")) 
      # layout control in order to keep the original format as much as possible 

docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) 
        # upload documents i ncname, the path to the files

第二,我折叠文本以使每个矢量有一个文档:

 document <- content(docs[[1]])
 document <- unlist(paste(document , collapse = ' '))

第三,我提取具有所需信息的页面,并使用正则表达式提取名称

 corporate.info <- unlist(str_extract_all(document, "\\f+.+CORPORATE+.+INFORMATION+.+\\f"))

### "\f" --> indicates the beggining and end of of a page
### "+.+CORPORATE+.+INFORMATION+.+"  --> indicates the page with the heading I was interested

 corporate.info <- unlist(str_extract_all(corporate.info, "[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}")) # extract names 
 corporate.info <- unique(corporate.info) # clean
 corporate.info <- str_replace_all(corporate.info, ".*Bank.*", "") # clean + similar stuff to clean