我正在使用tm包在R中导入多个pdf。我需要从pdf的内容中提取一些包含标题“企业信息”的特征向量。问题是双重的。首先,我无法使用此标题提取向量。其次,此向量以非常混乱的方式出现。我真的不能将人员的名字与公司所担任的职务相关联。这是我尝试构建的数据集的类型。我在下面展示一个例子。欢迎任何帮助。
vector_of_interest <- c(" CORPORATE INFORMATION\r\n BOARD OF DIRECTORS REGISTERED OFFICE\r\n Chuah Ah Bee Suite 12-02,12th Floor\r\n Executive Chairman Menara Zurich\r\n Chuah Hoon Phong 170 Jalan Argyll, 10050 Penang\r\n Group Managing Director Telephone Number : 04-2296 318\r\n Chan Kim Keow Facsimile Number : 04-2282 118\r\n Executive Director\r\n Loo Choo Gee\r\n Executive Director COMPANY SECRETARIES\r\n Chew Chee Khong\r\n Executive Director Gunn Chit Geok\r\n Ng Seng Bee (MAICSA 0673097)\r\n Independent Non-Executive Director Chew Siew Cheng\r\n Haji Ahmad Fazil Bin Haji Hashim (MAICSA 7019191)\r\n Independent Non-Executive Director\r\n Goh Choon Aik\r\n Independent Non-Executive Director SHARE REGISTRAR\r\n Tricor Investor Services Sdn Bhd\r\n AUDIT COMMITTEE Level 17, The Gardens North Tower\r\n Mid Valley City\r\n Ng Seng Bee Lingkaran Syed Putra\r\n Chairman 59200 Kuala Lumpur\r\n Haji Ahmad Fazil Bin Haji Hashim Telephone Number : 03-2264 3883\r\n Member Facsimile Number : 03-2282 1886\r\n Goh Choon Aik\r\n Member\r\n STOCK EXCHANGE LISTING\r\n REMUNERATION COMMITTEE Main Market of Bursa Malaysia Securities Berhad\r\n Stock Code : 7174\r\n Haji Ahmad Fazil Bin Haji Hashim Stock Name : CAB\r\n Chairman\r\n Chuah Ah Bee\r\n Member AUDITORS\r\n Ng Seng Bee\r\n Member Deloitte KassimChan\r\n Chartered Accountants\r\n 4th Floor, Wisma Wang\r\n NOMINATION COMMITTEE 251-A Jalan Burma\r\n 10350 Penang\r\n Haji Ahmad Fazil Bin Haji Hashim\r\n Chairman\r\n Ng Seng Bee PRINCIPAL BANKERS\r\n Member\r\n Goh Choon Aik Malayan Banking Berhad\r\n Member Hong Leong Bank Berhad\r\n United Overseas Bank (Malaysia) Berhad\r\n10 CAB Annual Report 2012\r\n")
#my attempt
library(tm)
library(tidyverse)
library(stringr)
Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible. I have also tried to add engine = "xpdf", before control
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents
document <- content(docs[[1]])
corporate.info <- unlist(str_extract_all(document, "CORPORATE INFORMATION.+"))
可在以下链接中找到pdf:jackson 信息在第10页
答案 0 :(得分:0)
我找到了解决方法:
首先,我将默认的ReadPDF engine
更改为xpdf
Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout"))
# layout control in order to keep the original format as much as possible
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf))
# upload documents i ncname, the path to the files
第二,我折叠文本以使每个矢量有一个文档:
document <- content(docs[[1]])
document <- unlist(paste(document , collapse = ' '))
第三,我提取具有所需信息的页面,并使用正则表达式提取名称
corporate.info <- unlist(str_extract_all(document, "\\f+.+CORPORATE+.+INFORMATION+.+\\f"))
### "\f" --> indicates the beggining and end of of a page
### "+.+CORPORATE+.+INFORMATION+.+" --> indicates the page with the heading I was interested
corporate.info <- unlist(str_extract_all(corporate.info, "[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}")) # extract names
corporate.info <- unique(corporate.info) # clean
corporate.info <- str_replace_all(corporate.info, ".*Bank.*", "") # clean + similar stuff to clean