我有一个包含位置的行的数据框。位置可以有各种格式,例如:
"New York Manhattan UpperEast"
"Upper East, Manhattan, New York"
"Manhattan, New York, Upper East"
"California, San Francisco, Knob Hill"
"San Francisco Knob Hill California"
我想搜索某些单词(例如,所有州名),并删除其他所有内容。输出应为
New York
New York
New York
California
California
我应该怎样在R?
答案 0 :(得分:3)
从内置state.name
向量中创建一个匹配任何状态的正则表达式,并使用gsubfn包中的strapplyc
应用它,如下所示:
x <- c("New York Manhattan UpperEast",
"Upper East, Manhattan, New York",
"Manhattan, New York, Upper East",
"California, San Francisco, Knob Hill",
"San Francisco Knob Hill California")
library(gsubfn)
states <- paste(state.name, collapse = "|")
strapplyc(x, states, simplify = TRUE)
,并提供:
[1] "New York" "New York" "New York" "California" "California"
答案 1 :(得分:1)
假设数据框看起来像这样:
names <- c("New York Manhattan UpperEast",
"Upper East, Manhattan, New York",
"Manhattan, New York, Upper East",
"California, San Francisco, Knob Hill",
"San Francisco Knob Hill California")
df <- data.frame(locations =names)
您可以使用某种形式的grep(这将获得剩余的字符串)
df$locations <- gsub("Manhattan|San\ Francisco","",df$locations)
但我现在看到你正在尝试匹配名称,这有望用于此用途:
library(stringr)
df$locations <- str_match(df$locations, "Manhattan|San\ Francisco")