以下是创建一些测试数据的代码:
df <- data.frame(page_id = c(3,3,3,3,3), element_id = c(19, 22, 26, 31, 31),
text = c("The Protected Percentage of your property value thats has been chosen is 0%",
"The Arrangement Fee payable at complettion: £50.00",
"Interest rate is fixed for the life of the period is: 5.40%",
"The Benchmark rate that will be used to calculate any early repayment 2.08%",
"The property value used in this scenario is 275,000.00"))
我有很多pdf文件,我希望使用正则表达式从中提取相同的信息。到目前为止,我已设法从1 pdf文件中提取所需的所有信息。下面是它的代码 - 带注释:
library("textreadr")
library("pdftools")
library("tidyverse")
library("tidytext")
library("textreadr")
library("tm")
# read in the PDF file
Off_let_data <- read_pdf("50045400_K021_2017-V001_300547.pdf")
# read all pdf file from a folder
files <- list.files(pattern = "pdf$")[1]
# extract the account number from the first pdf file
acc_num <- str_extract(files, "^\\d+")
# The RegEx's used to extract the relevant information
protec_per_reg <- "Protected\\sP\\w+\\sof"
Arr_Fee_reg <- "^The\\sArrangement\\sF\\w+"
Fix_inter_reg <- "Fixed\\sI\\w+\\sR\\w+"
Bench_rate_reg <- "Benchmark\\sR\\w+\\sthat"
# create a df that only includes the rows which match the above RegEx
Off_let <- Off_let_data %>% filter(page_id == 3, str_detect(Off_let_data$text, protec_per_reg)|
str_detect(Off_let_data$text, Arr_Fee_reg) | str_detect(Off_let_data$text, Fix_inter_reg) |
str_detect(Off_let_data$text, Bench_rate_reg))
# Now only extract the numbers from the above DF
off_let_num <- str_extract(Off_let$text, "\\d+\\.?\\d+")
# The first element is always a NA value - based on the structure of these PDF files
# replace the first element of this character vector with the below
off_let_num[is.na(off_let_num)] <- str_extract(Off_let$text, "\\d+%")[[1]]
off_let_num
off_let_num
变量是一个向量,其中包含pdf文件中需要的4个元素。
现在我想将所有这些步骤应用到包含许多pdf文件的文件夹中。所以,我已经设法将所有PDF文件读入单独的数据框 - 其代码如下:
# read all pdf files into a list
file_list <- list.files(pattern = '*.pdf')
# Read in all the pdf files into seperate data frames
for (file_name in off_let) {
assign(paste0("off","_",sub(".pdf","",file_name)), read_pdf(file_name))
}
我现在在工作目录中有很多数据框。我想在开始时将我应用于一个pdf文件的相同过程应用于以'off'开头的所有这些数据帧。
我想要的方法是将初始进程转换为函数,然后调用此函数应用于以“off”开头的所有数据帧。结果应附加到数据框中,该数据框应包括从这些pdf文件中提取的所有元素(4)。 我不知道如何实现这一目标。请帮忙!