Question

我喜欢根据冒号的存在来拆分pdf文档的信息。这是一个样本。

下载包含四个页面的更新PDF

我正在尝试以下方法。阅读pdf后，我试图用冒号分割它。

library(textreadr)
dat <- '~Here is the thing1.pdf' %>%
    textreadr::read_pdf()
dat
Source: local data frame [26 x 3]

   page_id element_id                                     text
1        1          1                       Here is the thing.
2        1          2                                Case ID 1
3        1          3 Exploring Angels: It is a long establish
4        1          4 page when looking at its layout. The poi
5        1          5 distribution of letters, as opposed to u
6        1          6 English. Many desktop publishing package
7        1          7 model text, and a search for 'lorem ipsu
8        1          8 versions have evolved over the years, so
9        1          9                           and the like).
10       1         10 New agency: Lorem Ipsum is simply dummy 
..     ...        ...                                      ...

OR

library(pdftools)
dat <- pdf_text("~Here is the thing1.pdf")
dat1 <- strsplit(dat[[1]], "\n")[[1]]
head(dat1)
[1] "Here is the thing.\r"                                                                                           
[2] "Case ID 1\r"                                                                                                    
[3] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a\r"
[4] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal\r"         
[5] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable\r"      
[6] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default\r"

dat2 <- dat1 %>%
  str_split(pattern = "\r") 
head(dat2)

[[1]]
[1] "Here is the thing." ""                  

[[2]]
[1] "Case ID 1" ""         

[[3]]
[1] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a"
[2] ""                                                                                                             

[[4]]
[1] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal"
[2] ""                                                                                                    

[[5]]
[1] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable"
[2] ""                                                                                                       

[[6]]
[1] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default"
[2] "

我希望将我的数据分类到这样的表中：

  Case.ID                             Exploring.Angels                        New.agency New.Factor New.Factor2 Creative.One
1       1 It is a long established fact that a reader  Lorem Ipsum is simply dummy text         ABC         BNM         <NA>
2       2               Various versions have evolved     It has survived not only five         ABC        <NA>          DFZ

Answer 1

以下是我使用tidyverse

进行操作的方法

library(tidyverse)

# read in the file, separate by line, convert to tibble
pdftools::pdf_text("../_xlam/Here is the thing1.pdf") %>% str_split("(\\r\\n)") %>% 
  unlist() %>% as_tibble() %>% 
# separate cases and mark lines containing colon
  mutate(case=cumsum(str_detect(value, "Case ID")),
         tag_line=str_detect(value, ": ")) %>%
# drop lines with Case ID, separate tag from text, move text into one column, fill the tags
  filter(!str_detect(value,"Case ID")) %>% 
  separate(value, into = c("key", "text"), sep=": ", fill="right", extra="merge") %>% 
  mutate(text=ifelse(is.na(text), key, text),
         key=ifelse(tag_line, key, NA)) %>% fill(key) %>% 
# summarize text by concatenation
  group_by(case, key) %>% summarise(text=paste(text, collapse = " ")) %>% 
# filter away the `Here is the thing` line 
  drop_na(key) %>%
# move values to columns
  spread(key=key, value=text)

根据字符串从pdf到csv创建Dataframe

1 个答案: