我有一个在R中阅读和清理的> 90个pdf文件的列表,我为这些文件中的每一个提取了两个字段:数字和日期。我当前的数据框架包括一列,其中有一个数字行,下一行是与该数字相对应的日期。我正在尝试将与Number对应的日期的行转换为一列。我在解决这个问题时遇到了很多麻烦,我将不胜感激。我已在“当前数据帧示例”部分中手动删除了每行一部分的部分字符串。请查看dput输出以查看实际数据帧的外观。
这是产生我当前数据帧的代码
PDFreader <- function(x){
t <- pdf_text (x)
page_1 <- t
}
op2 <- lapply(pt, PDFreader)
op2.1 <- sapply(op2 ,strsplit, split = "\n")
op3 <- rapply(op2.1, grep, pattern = "Number:|Date:",
value = TRUE) %>%
unique()
df_all <- as.data.frame(op3)%>%
unique()
df_all$op3 <- as.character(as.factor(df_all$op3))
dput(head(df_all))
structure(list(op3 = c("Number: 11", "Date: 01/03/2018 Last Revised Review: AM #17",
"Date: 01/03/2018 Last Revised Review: AM #17",
"Date: 01/03/2018 Last Revised Review: AM #17",
"Date: 01/03/2018 Last Revised Review: AM #17",
" Date: 09/10/2018 Last Revised Review: AM# 39"
)), .Names = "op3", row.names = c(NA, 6L), class = "data.frame")
我当前数据框的示例:
op3 --> COLUMN NAME
Number: 11
Date: 01/03/2018 .. some text
Date: 01/03/2018.. some text
Date: 01/03/2018 .. some text
Date: 01/03/2018 .. some text
Date: 09/10/2018 .. some text
Number: 12
Date: 12/06/2016 .. some text
Date: 12/06/2016 .. some text
Date: 12/06/2016 .. some text
Number: 13
Date: 10/29/2018 .. some text
Date: 10/29/2018 .. some text
Date: 10/29/2018 .. some text
Date: 10/29/2018.. some text
期望数据框
op3 op4
Number:11 Date:01/03/2018
Number:12 Date:12/06/2016
Number:13 Date:10/29/2018