我有一个带有地址的列。我想解析它,只是有州名。下面是我的列
structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company",
"10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub",
"10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545",
"11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133",
"1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733",
"1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007",
"830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870"
)), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
我还有另一个向量要比较并替换。
v<- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
我确实尝试使用match
和grep
,但它返回了NA's
。
答案 0 :(得分:2)
这是使用grepl
的基本R选项:
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
states <- paste0("\\b", v, "\\b", collapse="|")
states
[1] "\\bTexas\\b|\\bPennsylvania\\b|\\bOregon\\b|\\bOregon\\b|\\bIdaho\\b"
df[grepl(states, df$BreweryAddress), ]
我打印了states
,这样很清楚我们正在使用什么正则表达式模式来搜索啤酒厂的地址。我们正在使用每个州名的替代,并用单词边界标记括起来。这样可以确保我们不会意外匹配某个包含某些状态名称作为子字符串的字符串。
答案 1 :(得分:1)
这是一个tidyverse
解决方案。我们可以使用|
作为分隔符将状态基本上连接成一个单一的模式,以指示它们中的任何一个都可以是选项,然后从地址列中提取出来。这很粗糙(如果在爱达荷大街有一家啤酒厂怎么办?),但这可能就足够了。
library(tidyverse)
df <- structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company", "10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub", "10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545", "11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133", "1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733", "1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007", "830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870")), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
df %>%
mutate(State = str_extract(BreweryAddress, str_c(v, collapse = "|")))
#> BreweryName
#> 1 (512) Brewing Company
#> 2 0 Mile Brewing Company
#> 3 10 Barrel Brewing
#> 4 10 Barrel Brewing - Eastside Pub
#> 5 10 Barrel Brewing - Portland Pub
#> 6 10 Barrel Brewing Co.
#> BreweryAddress
#> 1 407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545
#> 2 11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133
#> 3 1501 E StSan Diego, California, 92101United States
#> 4 62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733
#> 5 1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007
#> 6 830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870
#> State
#> 1 Texas
#> 2 Pennsylvania
#> 3 <NA>
#> 4 Oregon
#> 5 Oregon
#> 6 Idaho
由reprex package(v0.2.0)于2018-09-25创建。
答案 2 :(得分:0)
图书馆纵梁对此很简单
v<- c("Texas","Pennsylvania","Oregon","Oregon","Oregon","Idaho")
library(stringr)
demographics$State <- str_extract(demographics$BreweryAddress,fixed(v, ignore_case=TRUE)) ##i have saved your data as demographics data frame.
答案 3 :(得分:0)
使用regmatches, gregexpr
数据:
df1 <-
structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company",
"10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub",
"10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545",
"11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133",
"1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733",
"1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007",
"830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870"
)), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
代码:
v_mod <- paste0(v, collapse="|")
df1$states <- sapply(regmatches(df1$BreweryAddress, gregexpr(v_mod, df1$BreweryAddress)), function(x){if(length(x)==0) NA else x})
结果:
# BreweryName BreweryAddress states
#4 (512) Brewing Company 407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545 Texas
#6 0 Mile Brewing Company 11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133 Pennsylvania
#8 10 Barrel Brewing 1501 E StSan Diego, California, 92101United States <NA>
#10 10 Barrel Brewing - Eastside Pub 62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733 Oregon
#12 10 Barrel Brewing - Portland Pub 1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007 Oregon
#14 10 Barrel Brewing Co. 830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870 Idaho