Question

以下是@Ken S编写的用于从OCR'd pdf中提取数据的代码，该代码提供了类似

的数据框

    name      Status                           
                                                   Page                      Words
   test.pdf  Present                         test_1, test_3                gym, school
   test1.pdf Present                         test1_4, test1_7           gym, swimming pool 
   test2.pdf Not Present                              -                         -

。但我希望数据变平，以便输出看起来像

fileName   Status        Page             Words                    TEXT
test.pdf   Present     test_1             gym            I go gym and school regularly 
test.pdf   Present     test_1             school         I go gym and school regularly
test.pdf   Present     test_3             school     Here is the next school
test1.pdf  Present     test1_4            swimming pool  In swimming pool
test1.pdf  Present     test1_7            gym         next to Gold gym
test2.pdf  Not Present    -               -

fileName =文件名称

状态 =如果找到任何字词，则“显示”否则“不存在”

Page =此处“_1”，“_ 3”定义找到该单词的页码;;页面上显示“test_1”字样“健身房”，并在页面“test_3”上找到“学校”字样。

单词 =找到所有单词;;在test.pdf文件的第1页和第3页只找到“健身房”和“学校”，在test1.pdf文件的第4页和第7页只找到“游泳池”和“健身房”。

TEXT =这是找到单词的文字

这是以下代码

all_files <- Sys.glob("*.pdf")
strings   <- c("school", "gym", "swimming pool")

# Read text from pdfs
texts <- lapply(all_files, function(x){
  img_file <- pdf_convert(x, format="tiff", dpi=400)
  return( tolower(ocr(img_file)) )
})

# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
  for(w in seq_along(strings)){
    intermed   <- grep(strings[w], texts[[d]])
    words[[d]] <- c(words[[d]], 
                    strings[w][ (length(intermed) > 0) ])
    pages[[d]] <- unique(c(pages[[d]], intermed))
  }
}

# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))

Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))

Words    <- sapply(words, paste0, collapse=", ")
#Words <- unlist(words, recursive = T)
Status   <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")

data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)

我尝试使用Words <- unlist(words, recursive = T)进行更改，但错误

Error in data.frame(row.names = fileName, Status = Status, Page = Page,  : 
  row names supplied are of the wrong length

任何建议应该做什么改进。

由于

P.S：访问sample files

转置列出元素并匹配R中的值

0 个答案: