grep多种模式或分组数据条件

时间:2018-08-31 08:03:57

标签: r

我有将数据分组的数据,

df <- data.frame(group_id= c(1, 1, 1, 1, 2, 1, 2, 3, 4),
                words = c("beach", "sand", "trip", "warm","travel", "water","beach","sand", "trees"),
                 ID = c("vacation", "vacation", "vacation", "vacation", "meeting","vacation","meeting","onduty", "hiking"))

group_idID列的组。现在,我想为每个组检查某些模式(“海滩”或“温暖”或“沙”),并在单独的列中打印匹配的模式,并在单独的列中匹配0(不匹配)或1(是)。 / p>

预期:

  id  words       ID           pattern Match
1  1  beach vacation Beach, sand, warm 1
2  1   sand vacation Beach, sand, warm 1
3  1   trip vacation Beach, sand, warm 1
4  1   warm vacation Beach, sand, warm 1
5  2 travel  meeting Beach             1
6  1  water vacation Beach, sand, warm 1
7  2  beach  meeting Beach             1
8  3   sand   onduty sand              1
9  4  trees  hiking  0                 0

4 个答案:

答案 0 :(得分:1)

ids <- df$ID[ grepl("^(beach|warm|sand)$",df$words) ]

df[df$ID %in% ids,]

#  group_id  words       ID
#1        1  beach vacation
#2        1   sand vacation
#3        1   trip vacation
#4        1   warm vacation
#5        2 travel  meeting
#6        1  water vacation
#7        2  beach  meeting
#8        3   sand   onduty

答案 1 :(得分:1)

您可以尝试以下方法。为unique查找与键group_id相关联的words个值。使用df子集[]

df[df$group_id %in% unique(df$group_id[df$words %in% c('beach', 'sand', 'warm')]),]

  group_id  words       ID
1        1  beach vacation
2        1   sand vacation
3        1   trip vacation
4        1   warm vacation
5        2 travel  meeting
6        1  water vacation
7        2  beach  meeting
8        3   sand   onduty

答案 2 :(得分:1)

使用sqldf: 首先选择具有group_id words的{​​{1}},然后从这些('beach','sand','warm')中选择所有值。

group_id

输出:

library(sqldf)
sqldf("select * from df where group_id IN(select group_id from df where words IN ('beach','sand','warm'))")

答案 3 :(得分:1)

我使用dplyr grep来获得所需的结果。 下面是代码:

library(dplyr) 

pattern <- c("Beach", "sand", "warm")
df <- data.frame(group_id= c(1, 1, 1, 1, 2, 1, 2, 3, 4),
                 words = c("beach", "sand", "trip", "warm","travel", "water","beach","sand", "trees"),
                 ID = c("vacation", "vacation", "vacation", "vacation", "meeting","vacation","meeting","onduty", "hiking"))

x <- df %>%
  group_by(group_id) %>%
  summarise(words = paste(words, collapse = " "))
y <- sapply(pattern, function(d) grep(paste0("\\b",d,"\\b"),x$words , ignore.case = T))
y <- setNames(unlist(y, use.names=F),rep(names(y), lengths(y)))
y <- data.frame(Match_pattern =names(y), group_id=y, row.names=NULL)
y <- y %>%
  group_by(group_id) %>%
  summarise(Match_pattern = paste(Match_pattern, collapse = ", "))

out <- merge(df, y, by = "group_id", all.x = T)
out$N <- ifelse(is.na(out$Match_pattern), 0, 1)

> out
  group_id  words       ID     Match_pattern N
1        1   sand vacation Beach, sand, warm 1
2        1   trip vacation Beach, sand, warm 1
3        1   warm vacation Beach, sand, warm 1
4        1  beach vacation Beach, sand, warm 1
5        1  water vacation Beach, sand, warm 1
6        2  beach  meeting             Beach 1
7        2 travel  meeting             Beach 1
8        3   sand   onduty              sand 1
9        4  trees   hiking              <NA> 0