我有从这些Wikipedia页面创建的PDF文件(例如):
https://en.wikipedia.org/wiki/AIM-120_AMRAAM
https://en.wikipedia.org/wiki/AIM-9_Sidewinder
我有一个要在文档中搜索并提取其中出现的句子的关键字列表。
tf.Variable
我可以调用该文件,从PDF中提取文本,从PDF中提取带有关键字的句子。如果我分别对每个关键字执行此操作,则可以正常工作,但是当我尝试在循环中执行此操作时,一直遇到行未追加的问题。相反,它几乎在进行绑定,然后就列数引发了错误。这是我的代码,非常感谢您提供的有关我可以做的工作的任何帮助。
如何使行正确附加并显示在每个PDF文件中?
keywords <- c("altitude", "range", "speed")
当我正确添加行之后,下一步将是将关键字作为变量添加到提取的句子的左侧。理想输出示例:
pdf.files <- list.files(path = "/path/to/file", pattern = "*.pdf", full.names = FALSE, recursive = FALSE)
for (i in 1:length(pdf.files)) {
for (j in 1:length(keywords)) {
text <- pdf_text(file.path("path", "to", "file", pdf.files[i]))
text2 <- tolower(text)
text3 <- gsub("\r", "", text2)
text4 <- gsub("\n", "", text3)
text5 <- grep(keywords[j], unlist(strsplit(text4, "\\.\\s+")), value = TRUE)
}
temp <- rbind(text5)
assign(pdf.files[i], temp)
}
这会在循环中完成还是作为单独的函数发布?
感谢您的帮助。
答案 0 :(得分:0)
好吧,这需要一些真实的思考,但是我使它起作用了,虽然它不漂亮,但是可以完成工作:
# This first part initializes the files to be written to
files <- list.files(path = "/path/to/file", pattern = "*.*", full.names = FALSE, recursive = FALSE)
for (h in 1:length(files)) {
temp1 <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("Title", "x")
colnames(temp1) <- x
write.table(temp1, paste0("/path/to/file", tools::file_path_sans_ext(files[h]), ".txt"), sep = "\t", row.names = FALSE, quote = FALSE)
}
# This next part fills in the files with the sentences
pdf.files <- list.files(path = "/path/to/file", pattern = "*.pdf", full.names = FALSE, recursive = FALSE)
for (i in 1:length(pdf.files)) {
for (j in 1:length(keywords)) {
text <- pdf_text(file.path("path", "to", "file", pdf.files[i]))
text2 <- tolower(text)
text3 <- gsub("\r", "", text2)
text4 <- gsub("\n", "", text3)
text5 <- as.data.frame(grep(keywords[j], unlist(strsplit(text4, "\\.\\s+")), value = TRUE))
colnames(text5) <- "x"
if (nrow(text5) != 0) {
title <- as.data.frame(keywords[j])
colnames(title) <- "Title"
temp <- cbind(title, text5)
temp <- unique(temp)
write.table(temp, paste0("/path/to/file", tools::file_path_sans_ext(pdf.files[i]), ".txt"), sep = "\t", row.names = FALSE, quote = FALSE, col.names = FALSE, append = TRUE)
}
}
}