Question

我想搜索一堆PDF，以便快速找到与我的研究相关的表格和图形。

#I load the following libraries
library(pdfsearch)
library(tm)
library(pdftools)

#I assign the directory of my PDF files to the path where they are located
directory <- '/References'

#and then I search the directory for the keywords "table", "graph", and "chart"
txt <- keyword_directory(directory,
 keyword = c('table', 'graph', 'chart'),
 split_pdf = TRUE,
 remove_hyphen = TRUE,
 full_names = TRUE)

#Up to this point everything works fine. I get a nice data.frame called "txt" 
#with 1356 objects in 7 columns. However, when I try to search the data.frame 
#I start running into trouble.

#I start with "hunter" a term that I know resides in the token_text column 
txt[which(txt$token_text == 'hunter'), ]

#executing this code produces the following message
[1] ID pdf_name keyword page_num line_num line_text token_text
<0 rows> (or 0-length row.names)

我使用正确的工具搜索我的data.frame吗？有没有更简单的方法可以交叉引用此数据？是否在某个地方提供了一个软件包，该软件包旨在帮助一个人浏览大量的PDF？谢谢您的时间

Answer 1

private set函数基于是否满足条件（对于该条件中给出的每个值，例如，数据帧列中的所有值）返回which或TRUE。您可以通过为要保留/丢弃的行输入FALSE值来对数据帧进行子集化。

结合使用，您将得到：
TRUE/FALSE，您这样做了，没有返回任何行。如评论中所指出的，txt[which(txt$token_text == 'hunter'), ]用于完全匹配，您可能没有完全匹配。

根据部分匹配或正则表达式获取which，您可以使用TRUE/FALSE函数： grepl

为了便于理解，我更喜欢使用txt[grepl("hunter", txt$token_text, ignore.case=TRUE), ]软件包：
dplyr

如何使用which函数搜索数据框？

1 个答案: