我有一个很大的字符列表,其中包含来自pdf文档的数百个元素。我正在处理它们,因此它们被存储在一个数据框中,其中名称在一列中,文本元素在另一列中。
我发现名称已编号,通常会开始一个新页面。我正在匹配用于编号和名称的正则表达式模式,并将其存储到数据帧的行中。我想将每个名称之间的文本存储在与这些页面之前的名称相同的行中。
这里是一个示例:
library(tidyverse)
text <- c("1. name Baramé stuff more stuff" ,"more stuff", "more stuff",
"2. name D'orsons stuff more stuff", "more stuff more stuff",
"3. name Bar-son stuff more stuff more stuff more stuff",
"4. name lastname stuff", "more stuff more stuff", "more stuff")
这给出了一个包含[1:9]元素的列表
我在与编号名称匹配的数据框中进行处理:
doc <- data_frame("name" = str_extract(text,
"^\\d+\\.\\s+([\\w\\. -'-]+)\\s+\(+[A-Z]+[0-9]+\)"),
"page" = str_detect(text, "^\\d+\\.\\s+([\\w\\. -'-]+)\\s+\(+[A-Z]+[0-9]+\)"),
"sep" = unlist(strsplit(text, "^\\d+\\.\\s+([\\w\\. -'-]+)\\s+\(+[A-Z]+[0-9]+\)")))
我无法弄清楚最后一步,其中编号名称之间包含文本元素(页面)的行全部串联成一行。基本上,“ 1。name lastname”之后和“ 2. name lastname”之前的所有文本都将进入变量“ text”,以此类推。
因此,最后,数据框具有与名称数相同的行数,并且文本在具有该名称的列中(名称之间的文本变为先前的名称)。
name text
1. name Baramé stuff more stuff more stuff more stuff
2. name D'orsons stuff more stuff more stuff more stuff
3. name Bar-son stuff more stuff more stuff more stuff
4. name lastname stuff more stuff more stuff
我希望我已经足够清楚地描述了这一点。
编辑:我已经修改了示例,以更能代表我使用的实际正则表达式的姓氏变化。
答案 0 :(得分:3)
根据显示的模式,我们可以从字符串开头的数字出现处创建分组变量,使用该变量paste
的内容,然后separate
将该列分成两部分>
library(tidyverse)
tibble(name = text, grp = cumsum(grepl("^\\d+", text)) )%>%
group_by(grp) %>%
summarise(name = paste(name, collapse=" ")) %>%
select(-grp) %>%
extract(name, into = c("name", "text"),
"^(\\S+\\s+\\S+\\s+\\S+)\\s+(.*)")
# A tibble: 4 x 2
# name text
# <chr> <chr>
#1 1. name lastname stuff more stuff more stuff more stuff
#2 2. name lastname stuff more stuff more stuff more stuff
#3 3. name lastname stuff more stuff more stuff more stuff
#4 4. name lastname stuff more stuff more stuff more stuff
或使用base R
txt <- aggregate(name ~ grp, data.frame(name = text,
grp = cumsum(grepl("^\\d+", text))), FUN = paste, collapse=" ")[[2]]
txt1 <- sub("(^(?:(\\S+\\s+){2})\\S+)", "\\1,", txt)
read.csv(text = txt1, header = FALSE, col.names = c("name", "text"))
# name text
#1 1. name lastname stuff more stuff more stuff more stuff
#2 2. name lastname stuff more stuff more stuff more stuff
#3 3. name lastname stuff more stuff more stuff more stuff
#4 4. name lastname stuff more stuff more stuff more stuff
使用更新后的示例,将sub
更改为
txt1 <- sub("(^(?:([^ ]+\\s+){2})[^ ]+)", "\\1,", txt)
read.csv(text = txt1, header = FALSE, col.names = c("name", "text"))
# name text
#1 1. name Baramé stuff more stuff more stuff more stuff
#2 2. name D'orsons stuff more stuff more stuff more stuff
#3 3. name Bar-son stuff more stuff more stuff more stuff
#4 4. name lastname stuff more stuff more stuff more stuff
或者采用tidyverse
方法
tibble(name = text, grp = cumsum(grepl("^\\d+", text)) )%>%
group_by(grp) %>%
summarise(name = paste(name, collapse=" ")) %>%
select(-grp) %>%
extract(name, into = c("name", "text"),
"^([^ ]+\\s+[^ ]+\\s+[^ ]+)\\s+(.*)")
# A tibble: 4 x 2
# name text
# <chr> <chr>
#1 1. name Baramé stuff more stuff more stuff more stuff
#2 2. name D'orsons stuff more stuff more stuff more stuff
#3 3. name Bar-son stuff more stuff more stuff more stuff
#4 4. name lastname stuff more stuff more stuff more stuff
答案 1 :(得分:2)
要分离每条记录,您可以获取给定元素中是否存在数字的累积总和,并以此为子集。
nums <- cumsum(grepl('[0-9]', text))
out <- sapply(seq(max(nums)), function(x) paste(text[nums == x], collapse = ' '))
out
# [1] "1. name lastname stuff more stuff more stuff more stuff"
# [2] "2. name lastname stuff more stuff more stuff more stuff"
# [3] "3. name lastname stuff more stuff more stuff more stuff"
# [4] "4. name lastname stuff more stuff more stuff more stuff"
要将姓名与记录的其余部分分开,可以使用word
中的stringr
提取前三个单词。
library(stringr)
data.frame(name = word(out, 1, 3), stuff = word(out, 4, -1))
# name stuff
# 1 1. name lastname stuff more stuff more stuff more stuff
# 2 2. name lastname stuff more stuff more stuff more stuff
# 3 3. name lastname stuff more stuff more stuff more stuff
# 4 4. name lastname stuff more stuff more stuff more stuff
或者,如果您想删除数字
library(stringr)
data.frame(name = word(out, 2, 3),
stuff = word(out, 4, -1))
# name stuff
# 1 name lastname stuff more stuff more stuff more stuff
# 2 name lastname stuff more stuff more stuff more stuff
# 3 name lastname stuff more stuff more stuff more stuff
# 4 name lastname stuff more stuff more stuff more stuff
答案 2 :(得分:2)
data.frame(text,stringsAsFactors = F)%>%
group_by(s= cumsum(grepl("^\\d+",text)))%>%
summarise(new=paste(text,collapse = " "))%>%
mutate(new=sub("((?:\\S+\\s+){3})","\\1:",new))%>%
do(read.table(text=.$new,sep=":",col.names = c("name","text")))
name text
1 1. name lastname stuff more stuff more stuff more stuff
2 2. name lastname stuff more stuff more stuff more stuff
3 3. name lastname stuff more stuff more stuff more stuff
4 4. name lastname stuff more stuff more stuff more stuff
在基本R中:
s=tapply(text,cumsum(grepl("^\\d+",text)),paste,collapse=" ")
read.table(text=sub("(^(?:\\S+\\s+){3})","\\1:",s),sep=":",col.names = c("name","text"))
name text
1 1. name lastname stuff more stuff more stuff more stuff
2 2. name lastname stuff more stuff more stuff more stuff
3 3. name lastname stuff more stuff more stuff more stuff
4 4. name lastname stuff more stuff more stuff more stuff