dplyr中的正则表达式匹配

时间:2015-07-07 13:43:54

标签: regex r dplyr stringr

在回答this question,时,我写了以下代码:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])

现在我的问题是:是否有一种简单的方法可以将最后两行合并为一个dplyr调用,大概是使用mutate()?或者,我也对do()的解决方案感兴趣。对于mutate()方法,由于我们正在提取2个组,因此我将采用一种解决方案,使用不同的正则表达式调用str_match()两次,每个所需的组一个。

修改:为了澄清,我在这里看到的主要挑战是str_match返回一个矩阵,我想知道如何在mutate()或{{{{}}中处理它1}}。我对使用其他提取信息的方法解决原始问题不感兴趣。已经有很多这样的解决方案here.

2 个答案:

答案 0 :(得分:6)

您可以使用tidyr包中的extract()执行此操作:

extract(df, Call_Num, into = c("letter", "number"), regex = "([A-Z]+)(\\d+)\\s*\\.", remove = FALSE)

                                             Call_Num letter number
1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
4          D753 .F4 Circulating Collection, 3rd Floor      D    753
5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

它不是dplyr,但如上面链接的CRAN页面所述,tidyr&#34;专门用于数据整理(不是一般的整形或聚合),并且与dplyr数据管道配合良好。&#34 ;

答案 1 :(得分:3)

您可以尝试使用do

df %>% 
  do(data.frame(., str_match(.$Call_Num,  "([A-Z]+)(\\d+)\\s*\\.")[,-1],
                              stringsAsFactors=FALSE)) %>%
  rename_(.dots=setNames(names(.)[-1],c('letter', 'number')))
#                                             Call_Num letter number
#1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
#2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
#3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
#4          D753 .F4 Circulating Collection, 3rd Floor      D    753
#5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

或者@SamFirke评论说,也可以使用

重命名列
  ---                                    %>%
 setNames(., c(names(.)[1], "letter", "number"))