以下是@Ken S编写的用于从OCR'd pdf中提取数据的代码,该代码提供了类似
的数据框 name Status
Page Words
test.pdf Present test_1, test_3 gym, school
test1.pdf Present test1_4, test1_7 gym, swimming pool
test2.pdf Not Present - -
。 但我希望数据变平,以便输出看起来像
fileName Status Page Words TEXT
test.pdf Present test_1 gym I go gym and school regularly
test.pdf Present test_1 school I go gym and school regularly
test.pdf Present test_3 school Here is the next school
test1.pdf Present test1_4 swimming pool In swimming pool
test1.pdf Present test1_7 gym next to Gold gym
test2.pdf Not Present - -
fileName =文件名称
状态 =如果找到任何字词,则“显示”否则“不存在”
Page =此处“_1”,“_ 3”定义找到该单词的页码;;页面上显示“test_1”字样“健身房”,并在页面“test_3”上找到“学校”字样。
单词 =找到所有单词;;在test.pdf文件的第1页和第3页只找到“健身房”和“学校”,在test1.pdf文件的第4页和第7页只找到“游泳池”和“健身房”。
TEXT =这是找到单词的文字
这是以下代码
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
#Words <- unlist(words, recursive = T)
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
我尝试使用Words <- unlist(words, recursive = T)
进行更改,但错误
Error in data.frame(row.names = fileName, Status = Status, Page = Page, :
row names supplied are of the wrong length
任何建议应该做什么改进。
由于
P.S:访问sample files