我有一个这样的数据框:
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"))
我想实施如下规则:
stocks <- c("google, yahoo")
climate <- c("air")
cuisine <- c("cookie")
并获得如下结果:
df_ne <- data.frame(id = c(1,2), keywords = c("stocks, climate, cuisine", "climate, cuisine")
怎么做呢?
答案 0 :(得分:3)
您可以使用str_replace_all
包中的stringr
library(dplyr)
library(stringr)
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"))
df %>%
mutate(keywords = str_replace_all(keywords,
c("google, yahoo" = "stocks","air" = "climate", "cookie" = "cuisine")))
答案 1 :(得分:2)
我喜欢霍兰德答案(+1),但是您也可以使用tidytext::unnest_tokens()
,如果您要输入的单词数不止六个,这会更简单。
首先,您可以创建一个映射df:
mapped <- rbind (data.frame(word_a = stocks, type = "stock", stringsAsFactors = F),
data.frame(word_a = climate, type = "climate", stringsAsFactors = F),
data.frame(word_a = cuisine, type = "cuisine", stringsAsFactors = F))
现在,您可以使用上述功能将几个未嵌套的df达到目标:
library(tidytext)
library(stringr)
library(tidyverse)
mapped <- mapped %>% unnest_tokens(word, word_a)
df %>%
unnest_tokens(word, keywords) %>% # split words
left_join(mapped) %>% # join to map
group_by(id) %>% # group
summarise(keywords = str_c(unique(type), collapse = ",")) # collapse the word (unique)
# A tibble: 2 x 2
id keywords
<dbl> <chr>
1 1 stock,climate,cuisine
2 2 cuisine,climate
请注意,由于第二行的顺序与第一行df
中的对应单词顺序相同,所以第二行的单词而不是预期的输出。
有数据:
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"), stringsAsFactors = F)
stocks <- c("google, yahoo")
climate <- c("air")
cuisine <- c("cookie")
答案 2 :(得分:1)
从这里开始,这是一个幼稚的解决方案:
key <- list(
stocks = c("google", "yahoo"),
climate = "air",
cuisine = "cookie"
)
df2 <- df
#replace by the key
for (k in 1:length(key)){
for(sk in key[[k]]){
df2$keywords <- gsub(sk, names(key)[k], df2$keywords, fixed = TRUE)
}
}
#remove duplicated items
df2$keywords <- lapply(strsplit(df2$keywords, ", "), function(l) paste(unique(l), sep = ","))