基于正则表达式模式在列表的元素之间连接

时间:2018-07-09 17:27:16

标签: r dplyr tidyverse

我有一个很大的字符列表,其中包含来自pdf文档的数百个元素。我正在处理它们,因此它们被存储在一个数据框中,其中名称在一列中,文本元素在另一列中。

我发现名称已编号,通常会开始一个新页面。我正在匹配用于编号和名称的正则表达式模式,并将其存储到数据帧的行中。我想将每个名称之间的文本存储在与这些页面之前的名称相同的行中。

这里是一个示例:

library(tidyverse)

text <- c("1. name Baramé stuff more stuff" ,"more stuff", "more stuff", 
      "2. name  D'orsons stuff more stuff", "more stuff more stuff", 
      "3. name Bar-son stuff more stuff more stuff more stuff", 
      "4. name lastname stuff", "more stuff more stuff", "more stuff")

这给出了一个包含[1:9]元素的列表

我在与编号名称匹配的数据框中进行处理:

doc <- data_frame("name" = str_extract(text, 
                                   "^\\d+\\.\\s+([\\w\\. -'-]+)\\s+\(+[A-Z]+[0-9]+\)"), 
              "page" = str_detect(text, "^\\d+\\.\\s+([\\w\\. -'-]+)\\s+\(+[A-Z]+[0-9]+\)"),
              "sep" = unlist(strsplit(text, "^\\d+\\.\\s+([\\w\\. -'-]+)\\s+\(+[A-Z]+[0-9]+\)")))

我无法弄清楚最后一步,其中编号名称之间包含文本元素(页面)的行全部串联成一行。基本上,“ 1。name lastname”之后和“ 2. name lastname”之前的所有文本都将进入变量“ text”,以此类推。

因此,最后,数据框具有与名称数相同的行数,并且文本在具有该名称的列中(名称之间的文本变为先前的名称)。

name                text
1. name Baramé    stuff more stuff more stuff more stuff
2. name D'orsons    stuff more stuff more stuff more stuff
3. name Bar-son    stuff more stuff more stuff more stuff 
4. name lastname    stuff more stuff more stuff

我希望我已经足够清楚地描述了这一点。

编辑:我已经修改了示例,以更能代表我使用的实际正则表达式的姓氏变化。

3 个答案:

答案 0 :(得分:3)

根据显示的模式,我们可以从字符串开头的数字出现处创建分组变量,使用该变量paste的内容,然后separate将该列分成两部分

library(tidyverse)
tibble(name = text, grp = cumsum(grepl("^\\d+", text)) )%>%
     group_by(grp) %>% 
     summarise(name = paste(name, collapse=" ")) %>% 
     select(-grp) %>%
     extract(name, into = c("name", "text"),
                     "^(\\S+\\s+\\S+\\s+\\S+)\\s+(.*)")
# A tibble: 4 x 2
#  name             text                                  
#  <chr>            <chr>                                 
#1 1. name lastname stuff more stuff more stuff more stuff
#2 2. name lastname stuff more stuff more stuff more stuff
#3 3. name lastname stuff more stuff more stuff more stuff
#4 4. name lastname stuff more stuff more stuff more stuff

或使用base R

txt <- aggregate(name ~ grp, data.frame(name = text, 
   grp = cumsum(grepl("^\\d+", text))), FUN = paste, collapse=" ")[[2]]
txt1 <-  sub("(^(?:(\\S+\\s+){2})\\S+)", "\\1,", txt)
read.csv(text = txt1, header = FALSE, col.names = c("name", "text"))
#            name                                    text
#1 1. name lastname  stuff more stuff more stuff more stuff
#2 2. name lastname  stuff more stuff more stuff more stuff
#3 3. name lastname  stuff more stuff more stuff more stuff
#4 4. name lastname  stuff more stuff more stuff more stuff

更新

使用更新后的示例,将sub更改为

txt1 <- sub("(^(?:([^ ]+\\s+){2})[^ ]+)", "\\1,", txt)
read.csv(text = txt1, header = FALSE, col.names = c("name", "text"))
#               name                                    text
#1    1. name Baramé  stuff more stuff more stuff more stuff
#2 2. name  D'orsons  stuff more stuff more stuff more stuff
#3   3. name Bar-son  stuff more stuff more stuff more stuff
#4  4. name lastname  stuff more stuff more stuff more stuff

或者采用tidyverse方法

tibble(name = text, grp = cumsum(grepl("^\\d+", text)) )%>%
      group_by(grp) %>% 
      summarise(name = paste(name, collapse=" ")) %>% 
      select(-grp) %>%
      extract(name, into = c("name", "text"),
                  "^([^ ]+\\s+[^ ]+\\s+[^ ]+)\\s+(.*)")
# A tibble: 4 x 2
#  name              text                                  
#  <chr>             <chr>                                 
#1 1. name Baramé    stuff more stuff more stuff more stuff
#2 2. name  D'orsons stuff more stuff more stuff more stuff
#3 3. name Bar-son   stuff more stuff more stuff more stuff
#4 4. name lastname  stuff more stuff more stuff more stuff

答案 1 :(得分:2)

要分离每条记录,您可以获取给定元素中是否存在数字的累积总和,并以此为子集。

nums <- cumsum(grepl('[0-9]', text))
out <- sapply(seq(max(nums)), function(x) paste(text[nums == x], collapse = ' '))
out
# [1] "1. name lastname stuff more stuff more stuff more stuff"
# [2] "2. name lastname stuff more stuff more stuff more stuff"
# [3] "3. name lastname stuff more stuff more stuff more stuff"
# [4] "4. name lastname stuff more stuff more stuff more stuff"

要将姓名与记录的其余部分分开,可以使用word中的stringr提取前三个单词。

library(stringr)
data.frame(name = word(out, 1, 3), stuff = word(out, 4, -1))
#               name                                  stuff
# 1 1. name lastname stuff more stuff more stuff more stuff
# 2 2. name lastname stuff more stuff more stuff more stuff
# 3 3. name lastname stuff more stuff more stuff more stuff
# 4 4. name lastname stuff more stuff more stuff more stuff

或者,如果您想删除数字

library(stringr)
data.frame(name = word(out, 2, 3),
           stuff = word(out, 4, -1))
#            name                                  stuff
# 1 name lastname stuff more stuff more stuff more stuff
# 2 name lastname stuff more stuff more stuff more stuff
# 3 name lastname stuff more stuff more stuff more stuff
# 4 name lastname stuff more stuff more stuff more stuff

答案 2 :(得分:2)

data.frame(text,stringsAsFactors = F)%>%
   group_by(s= cumsum(grepl("^\\d+",text)))%>%
   summarise(new=paste(text,collapse = " "))%>%
   mutate(new=sub("((?:\\S+\\s+){3})","\\1:",new))%>%
   do(read.table(text=.$new,sep=":",col.names = c("name","text")))
              name                                   text
1 1. name lastname  stuff more stuff more stuff more stuff
2 2. name lastname  stuff more stuff more stuff more stuff
3 3. name lastname  stuff more stuff more stuff more stuff
4 4. name lastname  stuff more stuff more stuff more stuff

在基本R中:

s=tapply(text,cumsum(grepl("^\\d+",text)),paste,collapse=" ")
read.table(text=sub("(^(?:\\S+\\s+){3})","\\1:",s),sep=":",col.names = c("name","text"))
               name                                   text
1 1. name lastname  stuff more stuff more stuff more stuff
2 2. name lastname  stuff more stuff more stuff more stuff
3 3. name lastname  stuff more stuff more stuff more stuff
4 4. name lastname  stuff more stuff more stuff more stuff