Question

我有从这些Wikipedia页面创建的PDF文件（例如）：

https://en.wikipedia.org/wiki/AIM-120_AMRAAM

https://en.wikipedia.org/wiki/AIM-9_Sidewinder

我有一个要在文档中搜索并提取其中出现的句子的关键字列表。

tf.Variable

我可以调用该文件，从PDF中提取文本，从PDF中提取带有关键字的句子。如果我分别对每个关键字执行此操作，则可以正常工作，但是当我尝试在循环中执行此操作时，一直遇到行未追加的问题。相反，它几乎在进行绑定，然后就列数引发了错误。这是我的代码，非常感谢您提供的有关我可以做的工作的任何帮助。

如何使行正确附加并显示在每个PDF文件中？

keywords <- c("altitude", "range", "speed")

当我正确添加行之后，下一步将是将关键字作为变量添加到提取的句子的左侧。理想输出示例：

pdf.files <- list.files(path = "/path/to/file", pattern = "*.pdf", full.names = FALSE, recursive = FALSE)
for (i in 1:length(pdf.files)) {
    for (j in 1:length(keywords)) {
        text <- pdf_text(file.path("path", "to", "file", pdf.files[i]))
        text2 <- tolower(text)
        text3 <- gsub("\r", "", text2)
        text4 <- gsub("\n", "", text3)
        text5 <- grep(keywords[j], unlist(strsplit(text4, "\\.\\s+")), value = TRUE)
    }
    temp <- rbind(text5)
    assign(pdf.files[i], temp)
}

这会在循环中完成还是作为单独的函数发布？

感谢您的帮助。

Answer 1

好吧，这需要一些真实的思考，但是我使它起作用了，虽然它不漂亮，但是可以完成工作：

# This first part initializes the files to be written to
files <- list.files(path = "/path/to/file", pattern = "*.*", full.names = FALSE, recursive = FALSE)
for (h in 1:length(files)) {
    temp1 <- data.frame(matrix(ncol = 2, nrow = 0))
    x <- c("Title", "x")
    colnames(temp1) <- x
    write.table(temp1, paste0("/path/to/file", tools::file_path_sans_ext(files[h]), ".txt"), sep = "\t", row.names = FALSE, quote = FALSE)
}
# This next part fills in the files with the sentences
pdf.files <- list.files(path = "/path/to/file", pattern = "*.pdf", full.names = FALSE, recursive = FALSE)
for (i in 1:length(pdf.files)) {
    for (j in 1:length(keywords)) {
        text <- pdf_text(file.path("path", "to", "file", pdf.files[i]))
        text2 <- tolower(text)
        text3 <- gsub("\r", "", text2)
        text4 <- gsub("\n", "", text3)
        text5 <- as.data.frame(grep(keywords[j], unlist(strsplit(text4, "\\.\\s+")), value = TRUE))
        colnames(text5) <- "x"
        if (nrow(text5) != 0) {
            title <- as.data.frame(keywords[j])
            colnames(title) <- "Title"
            temp <- cbind(title, text5)
            temp <- unique(temp)
            write.table(temp, paste0("/path/to/file", tools::file_path_sans_ext(pdf.files[i]), ".txt"), sep = "\t", row.names = FALSE, quote = FALSE, col.names = FALSE, append = TRUE)
        }
    }
}

R：在for循环中将多行追加到数据框

1 个答案: