我的df看起来像这样:
df <- data.frame(
x = c(
"800 Block of MAIN ST",
"100 Block of CHESTNUT AV",
"BAY ST / WELLINGTON ST",
"LARKIN ST / ELLIS ST",
"MAPLE ST / WELLINGTON ST",
"MEANDERING RD / MAIN ST"),
y = rnorm(6))
我想提取第一个街道名称和最后一个街道类型。
期望的输出:
x y x.1 x.2
1 800 Block of MAIN ST -0.6745405 MAIN ST
2 100 Block of CHESTNUT AV -1.1316017 CHESTNUT AV
3 BAY ST / WELLINGTON ST 1.2887577 BAY ST
4 LARKIN ST / ELLIS ST 1.4606264 LARKIN ST
5 MAPLE ST / WELLINGTON ST 0.6538595 MAPLE ST
6 MEANDERING RD / MAIN ST 0.8472322 MEANDERING ST
答案 0 :(得分:4)
library(stringr)
df[,c("street", "type")] <- list(str_extract(df$x, "[A-Z]{3,}"), str_extract(df$x, "[A-Z]+$"))
# x y street type
# 1 800 Block of MAIN ST 0.7787495 MAIN ST
# 2 100 Block of CHESTNUT AV -0.7069777 CHESTNUT AV
# 3 BAY ST / WELLINGTON ST -0.2365061 BAY ST
# 4 LARKIN ST / ELLIS ST 0.1399500 LARKIN ST
# 5 MAPLE ST / WELLINGTON ST -0.3423978 MAPLE ST
# 6 MEANDERING RD / MAIN ST 0.6434969 MEANDERING ST
答案 1 :(得分:3)
df <- within(df, st_name <- sub(".*?([A-Z]{3,}).*", "\\1", x, perl=TRUE))
df <- within(df, st_type <- sub(".+? ([A-Z]+)$", "\\1", x, perl=TRUE))
# x y st_name st_type
#1 800 Block of MAIN ST 1.92908789 MAIN ST
#2 100 Block of CHESTNUT AV 0.02487045 CHESTNUT AV
#3 BAY ST / WELLINGTON ST -2.33411242 BAY ST
#4 LARKIN ST / ELLIS ST -1.17946144 LARKIN ST
#5 MAPLE ST / WELLINGTON ST 0.12913797 MAPLE ST
#6 MEANDERING RD / MAIN ST -0.94150930 MEANDERING ST
或者,如果您不喜欢使用within
:
df$st_name <- sub(".*?([A-Z]{3,}).*", "\\1", df$x, perl=TRUE)
df$st_type <- sub(".+? ([A-Z]+)$", "\\1", df$x, perl=TRUE)
答案 2 :(得分:3)
这是一个类似的解决方案,使用单个正则表达式结合the development version of data.table
中的新tstrsplit
函数
library(data.table) # v1.9.5+
setDT(df)[, c("street", "type") :=
tstrsplit(sub(".*?([A-Z]{3,}).*([A-Z]{2,})", "\\1,\\2", x), ",")]
df
# x y street type
# 1: 800 Block of MAIN ST -1.4391801 MAIN ST
# 2: 100 Block of CHESTNUT AV 1.4917789 CHESTNUT AV
# 3: BAY ST / WELLINGTON ST -0.0369405 BAY ST
# 4: LARKIN ST / ELLIS ST 0.7320230 LARKIN ST
# 5: MAPLE ST / WELLINGTON ST 0.7189120 MAPLE ST
# 6: MEANDERING RD / MAIN ST -0.9836794 MEANDERING ST
基本上,这里的想法是在一个sub
调用中捕获两个组,用逗号连接它们(如果你愿意,可以选择别的东西),然后执行转置sting split (tstrsplit
),以便在通过引用创建它们时将它们转换为两个单独的列(使用:=
运算符)