Question

我正在尝试将城市/县名（最好提供有关州名的信息）与其对应的州名进行匹配，然后使用left_join()将其前三个电话号码附加为另一列。我最初的想法是复制城市/县名列，然后使用sapply()和grep()将它们替换为州名，然后使用left_join()将其与电话号码列合并，但是我的代码似乎无效。

library(dplyr)

location <- data.frame(location = c('Asortia, New York', 'Buffalo, New York', 'New York, New York',  'Alexandra, Virginia', 'Fairfax, Virginia', 'Baltimore, Maryland', 'Springfield, Maryland'), number = c(100, 200, 300, 400, 500, 600, 700))

state <- data.frame(state = c('New York', 'Virginia', 'Maryland'))

sapply(as.character(state$state), function(i) grep(i, location$location))

### doesn't work! ###
### my desired output would be ###

  location number
1 New York    100
2 New York    200
3 New York    300
4 Virginia    400
5 Virginia    500
6 Maryland    600
7 Maryland    700

这样我就可以使用left_join将上面生成的输出与他们的三位数电话号码合并。例如，

df <- location
names(df)[1] <- 'state'
digit <- data.frame(state = c('New York', 'Virginia', 'Maryland'), digit = c(212, 703, 410))
   
new_df <- left_join(df, digit, by = 'state')

### the desired output ###

  location number digit
1 New York    100   212
2 New York    200   212
3 New York    300   212
4 Virginia    400   703
5 Virginia    500   703
6 Maryland    600   410
7 Maryland    700   410

我已经引用了this和this线程，但是没有得到足够的线索。希望有人可以帮助我。

##更新

我发现在grepl中使用for loop也可以，但是如果您有大量数据（我正在处理的数据有200万个观测值），则处理速度可能会很慢。 / p>

for (i in state$state) { 
location$location[grepl(i, location$location)] <- i
}

Answer 1

您可以使用str_match，map和unite：

library(tidyverse)

location$state <- map_df(state, ~str_match(location$location, .x)) %>% 
                  unite("state", na.rm=T) %>% 
                  pull()

left_join(location, digit, by = "state") %>% 
  select(state, number, digit)

     state number digit
1 New York    100   212
2 New York    200   212
3 New York    300   212
4 Virginia    400   703
5 Virginia    500   703
6 Maryland    600   410
7 Maryland    700   410

Answer 2

也许我们可以通过from PIL import Image def load(path): return Image.open(path)对“州”数据集中“州”列中的“ str_remove”向量paste进行使用来使用str_c作为匹配的正则表达式向量之前的任何内容（要删除）

pattern

另一种选择是将library(stringr) library(dplyr) location %>% mutate(location = str_remove(location, str_c(".*(?=(", str_c(state$state, collapse = "|"), "))"))) # location number #1 New York 100 #2 New York 200 #3 New York 300 #4 Virginia 400 #5 Virginia 500 #6 Maryland 600 #7 Maryland 700分为两列，然后删除separate

first

或者，如果我们有特定的模式，则通过从开头（library(tidyr) location %>% separate(location, into = c('unwanted', 'location'), sep=",\\s*") %>% select(-unwanted)）开始匹配一个或多个不是,的字符，然后加上^和零或零来删除前缀部分。 ,

中有更多空格（\\s*）作为模式

str_remove

按参数模式匹配并左联接

2 个答案: