如何在R中使用str_split和regex?

时间:2018-03-22 22:50:29

标签: r regex stringr strsplit

我有这个字符串:

235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things

我想用6位数字分割字符串。即 - 我想要这个:

235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things

如何使用正则表达式执行此操作?以下内容不起作用(使用stringr包):

> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""

我错过了什么?

4 个答案:

答案 0 :(得分:5)

这是一种方法,基础R使用正向前瞻和后视,并感谢@thelatemail进行更正:

strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"  
# [2] "224072,and,other2524,14084-thingies"    
# [3] "223552,testing,some/2wr24,14084-things"

答案 1 :(得分:2)

使用str_extract_all的替代方法。注意我已使用.*?进行'非贪婪'匹配,否则.*会扩展以抓取所有内容:

> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things"  "224072,and,other2524,14084-thingies"    "223552,testing,some/2wr24,14084-things"

答案 2 :(得分:0)

一种易于理解的方法是添加标记,然后分割这些标记的位置。这样做的优点是只能查找6位数的序列,并且不需要周围文本中的任何其他功能,其功能可能会随着您添加新的和未发布的数据而改变。

library(stringr)
library(magrittr)

str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"

out <- 
    str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>% 
    str_split("#SPLIT_HERE#") %>% 
    unlist

[1] ""                                       "235072,testing,some252f4,14084-things" 
[3] "224072,and,other2524,14084-thingies"    "223552,testing,some/2wr24,14084-things"

如果您的匹配发生在字符串的开头或结尾,str_split()将在结果向量中插入空白字符条目以指示(如上所述)。如果您不需要该信息,可以使用out[nchar(out) != 0]轻松删除该信息。

[1] "235072,testing,some252f4,14084-things"  "224072,and,other2524,14084-thingies"   
[3] "223552,testing,some/2wr24,14084-things"

答案 3 :(得分:0)

With less complex regex, you can do as following:

s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start, 
    end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"