无法完成这项任务
考虑一个数据框" usa"有3列,"标题","城市"和"州" (可再现的):
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data.frame(title, city, state)
导致:
title city state
1 Events in Chicago, September
2 California hotels
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
现在我要做的是为前2个观察填充STATE变量,现在已经缺失了。
TITLE变量包含线索:每个条目中都提到了城市或州。
我需要做以下事情:
所以我最终得到的是:
title city state
1 Events in Chicago, September IL
2 California hotels California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
换句话说,在第二行中,标题包含一个单词&#34; California&#34;,因此从状态向量中找到匹配状态。但是,在第一行,“芝加哥”这个词是#34;是关键,数据框中有另一个条目(第4行),它将芝加哥与#34; IL&#34;国家,所以&#34; IL&#34;必须粘贴在&#34;州&#34;的第一行。列。
等待社区的想法:)谢谢!
答案 0 :(得分:0)
我建议你使用stringr
包;特别是一个名为str_extract
的函数。
如果您有完整的城市列表,例如city <- c("Los Angeles", "Chicago")
,然后您可以使用paste(city, collapse = '|'
)将其转换为正则表达式。这会给你:'Los Angeles|Chicago'
。使用str_extract
,您可以提取该城市(将提取它看到的第一个城市,如果没有,则提取NA
)。这是完整的代码。 注意:这仅适用于您的数据框是data_frame(tibble),而不是data.frame (不完全确定原因,还没有查看)
library(tidyverse)
library(stringr)
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data_frame(title, city, state) # notice this is a data_frame not data.frame
cities <- paste(c("Los Angeles", "Chicago"), collapse = '|')
states <- paste(c("California", "IL"), collapse = '|')
usa <- usa %>%
mutate(city = ifelse(city == '', str_extract(title, cities), city),
state = ifelse(state == '', str_extract(title, states), state))
这导致:
# A tibble: 4 x 3
title city state
<chr> <chr> <chr>
1 Events in Chicago, September Chicago <NA>
2 California hotels <NA> California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL