提取大写单词并提取字符串中的最后一个单词

时间:2015-08-10 18:46:14

标签: regex r

我的df看起来像这样:

df <- data.frame(
    x = c(
        "800 Block of MAIN ST",
        "100 Block of CHESTNUT AV", 
        "BAY ST / WELLINGTON ST", 
        "LARKIN ST / ELLIS ST",
        "MAPLE ST / WELLINGTON ST", 
        "MEANDERING RD / MAIN ST"),
    y = rnorm(6))

我想提取第一个街道名称和最后一个街道类型。

期望的输出:

                         x          y  x.1        x.2
1     800 Block of MAIN ST -0.6745405  MAIN       ST
2 100 Block of CHESTNUT AV -1.1316017  CHESTNUT   AV 
3   BAY ST / WELLINGTON ST  1.2887577  BAY        ST
4     LARKIN ST / ELLIS ST  1.4606264  LARKIN     ST
5 MAPLE ST / WELLINGTON ST  0.6538595  MAPLE      ST
6  MEANDERING RD / MAIN ST  0.8472322  MEANDERING ST

3 个答案:

答案 0 :(得分:4)

library(stringr)
df[,c("street", "type")] <- list(str_extract(df$x, "[A-Z]{3,}"), str_extract(df$x, "[A-Z]+$"))
#                          x          y     street type
# 1     800 Block of MAIN ST  0.7787495       MAIN   ST
# 2 100 Block of CHESTNUT AV -0.7069777   CHESTNUT   AV
# 3   BAY ST / WELLINGTON ST -0.2365061        BAY   ST
# 4     LARKIN ST / ELLIS ST  0.1399500     LARKIN   ST
# 5 MAPLE ST / WELLINGTON ST -0.3423978      MAPLE   ST
# 6  MEANDERING RD / MAIN ST  0.6434969 MEANDERING   ST

答案 1 :(得分:3)

df <- within(df, st_name <- sub(".*?([A-Z]{3,}).*", "\\1", x, perl=TRUE))

df <- within(df, st_type <- sub(".+? ([A-Z]+)$", "\\1", x, perl=TRUE))
#                         x           y    st_name st_type
#1     800 Block of MAIN ST  1.92908789       MAIN      ST
#2 100 Block of CHESTNUT AV  0.02487045   CHESTNUT      AV
#3   BAY ST / WELLINGTON ST -2.33411242        BAY      ST
#4     LARKIN ST / ELLIS ST -1.17946144     LARKIN      ST
#5 MAPLE ST / WELLINGTON ST  0.12913797      MAPLE      ST
#6  MEANDERING RD / MAIN ST -0.94150930 MEANDERING      ST

或者,如果您不喜欢使用within

df$st_name <- sub(".*?([A-Z]{3,}).*", "\\1", df$x, perl=TRUE)
df$st_type <- sub(".+? ([A-Z]+)$", "\\1", df$x, perl=TRUE)

答案 2 :(得分:3)

这是一个类似的解决方案,使用单个正则表达式结合the development version of data.table中的新tstrsplit函数

library(data.table) # v1.9.5+
setDT(df)[, c("street", "type") := 
              tstrsplit(sub(".*?([A-Z]{3,}).*([A-Z]{2,})", "\\1,\\2", x), ",")]
df
#                           x          y     street type
# 1:     800 Block of MAIN ST -1.4391801       MAIN   ST
# 2: 100 Block of CHESTNUT AV  1.4917789   CHESTNUT   AV
# 3:   BAY ST / WELLINGTON ST -0.0369405        BAY   ST
# 4:     LARKIN ST / ELLIS ST  0.7320230     LARKIN   ST
# 5: MAPLE ST / WELLINGTON ST  0.7189120      MAPLE   ST
# 6:  MEANDERING RD / MAIN ST -0.9836794 MEANDERING   ST

基本上,这里的想法是在一个sub调用中捕获两个组,用逗号连接它们(如果你愿意,可以选择别的东西),然后执行转置sting split tstrsplit),以便在通过引用创建它们时将它们转换为两个单独的列(使用:=运算符)