Question

我有一个文本文件，如下所示：

These are the hig hlights. Transit ioning to this, hello. I have 
provided this informat ion. The man has this dis eas e. He needs to take this dos age of medicine. Fo r o ne mo nth, thro ug h this pro g ram, do this. Do no t overdose.

有许多单词被分解。有什么方法可以注意到单词结构中的错误并通过r进行修复？

所以基本上：

These are the highlights. Transitioning to this, hello. I have 
provided this information. The man has this disease. He needs to take this dosage of medicine. For one month, through this program, do this. Do not overdose.

我使用以下代码从pdf中获取了文本：

file <- 'C:/Project/Section/SubSection/text.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))

corpus <- VCorpus(URISource(file), readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

txt<- write(corpus.array, "C:/Project/Section/SubSection/text1.txt") 
readtext<- eval(readLines("C:/Project/Section/SubSection/text1.txt"))

这产生了笨拙的间距文本。有没有更好的方法将pdf转换为文本文件？

注意并删除单词中字母之间的错误空格

0 个答案: