我有一个文本文件,如下所示:
These are the hig hlights. Transit ioning to this, hello. I have
provided this informat ion. The man has this dis eas e. He needs to take this dos age of medicine. Fo r o ne mo nth, thro ug h this pro g ram, do this. Do no t overdose.
有许多单词被分解。有什么方法可以注意到单词结构中的错误并通过r进行修复?
所以基本上:
These are the highlights. Transitioning to this, hello. I have
provided this information. The man has this disease. He needs to take this dosage of medicine. For one month, through this program, do this. Do not overdose.
我使用以下代码从pdf中获取了文本:
file <- 'C:/Project/Section/SubSection/text.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
txt<- write(corpus.array, "C:/Project/Section/SubSection/text1.txt")
readtext<- eval(readLines("C:/Project/Section/SubSection/text1.txt"))
这产生了笨拙的间距文本。有没有更好的方法将pdf转换为文本文件?