Question

我正在尝试使用TM包读取pdf文件。在大多数尝试中我都成功地完成了，但是其中一次。我有几个文件夹，每个文件夹有数百个文件。我读过所有这些但只有一个。问题是该特定文件夹中的pdfs在第一页底部有一系列图像，阻止我阅读它们。我收到以下错误：

int count = 2;
for(count--;count>0; count--) \\ Count is decremented to 1 then count is tested >0
{} \\ some code is executed then count is decremented by 1 and will fail the >0 test
\\ loop terminates

如果删除第一页，我会设法阅读它们。我可以做到这一点而不会丢失相关信息，但这是太多的工作。

我尝试使用xpdf和ghoststring，但两者都给了我同样的错误。

我的代码如下：

Error in strptime(d, fmt) : input string is too long

有人有建议吗？我用的是Mac。

Answer 1

最新答案：

但是我最近发现，在当前版本的tm（0.7-4）下，readPDF使用pdftools作为默认读取pdf的工具。

library(tm)

directory <- getwd() # change this to directory where pdf-files are located

# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), 
                               readerControl = list(reader = readPDF))

用TM包阅读pdf

1 个答案: