注意并删除单词中字母之间的错误空格

时间:2018-07-24 17:22:00

标签: r pdf tm

我有一个文本文件,如下所示:

These are the hig hlights. Transit ioning to this, hello. I have 
provided this informat ion. The man has this dis eas e. He needs to take this dos age of medicine. Fo r o ne mo nth, thro ug h this pro g ram, do this. Do no t overdose.

有许多单词被分解。有什么方法可以注意到单词结构中的错误并通过r进行修复?

所以基本上:

These are the highlights. Transitioning to this, hello. I have 
provided this information. The man has this disease. He needs to take this dosage of medicine. For one month, through this program, do this. Do not overdose.

我使用以下代码从pdf中获取了文本:

file <- 'C:/Project/Section/SubSection/text.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))

corpus <- VCorpus(URISource(file), readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])

txt<- write(corpus.array, "C:/Project/Section/SubSection/text1.txt") 
readtext<- eval(readLines("C:/Project/Section/SubSection/text1.txt"))

这产生了笨拙的间距文本。有没有更好的方法将pdf转换为文本文件?

0 个答案:

没有答案
相关问题