使用多个匹配模式标记/分类字符串列

时间:2015-11-12 13:28:37

标签: regex r string-matching sapply grepl

我有一个数据框,其中包含一列字符串,需要根据另一个数据框进行分类,该数据框在一列中包含类别标签,在另一列中包含匹配的术语/模式。

有50多个类别,每个字符串可以匹配多个类别,而其他类别没有匹配项。如何使用类别标签有效地标记这些字符串?

下面是一个简单的示例数据集和我希望得到的输出。如果它有任何区别,真实数据集中的字符串比这些示例字符串长得多,并且有几十个字符串。

recipes <- c('fresh asparagus', 'a bunch of bananas', 'one pound pork', 'no fruits, no veggies, no nothing', 'broccoli or spinach','I like apples, asparagus, and pork', 'meats like lamb', 'venison sausage and fried eggs', 'spinach and arugula salad', 'scrambled or poached eggs', 'sourdough english muffins')
recipes_df <- data.frame(recipes, stringsAsFactors = FALSE)

category <- c('vegetable', 'fruit', 'meat','bread','dairy')
items <- c('arugula|asparagus|broccoli|peas|spinach', 'apples|bananas|blueberries|oranges', 'lamb|pork|turkey|venison', 'sourdough', 'buttermilk|butter|cream|eggs')
category_df <- data.frame(category, items)

这是我希望获得的输出:

                          recipes            recipes_category
1                     fresh asparagus              vegetable
2                  a bunch of bananas                  fruit
3                      one pound pork                   meat
4   no fruits, no veggies, no nothing                   <NA>
5                 broccoli or spinach              vegetable
6  I like apples, asparagus, and pork fruit, vegetable, meat
7                     meats like lamb                   meat
8      venison sausage and fried eggs            meat, dairy
9           spinach and arugula salad              vegetable
10          scrambled or poached eggs                  dairy
11          sourdough english muffins                 breads

我相信grepl和for循环的一些组合或者apply的版本是必要的,但我在下面尝试过的例子确实暴露了我对R的了解程度。例如,使用sapply给出了我期望的结果,{{ 1}}但我不确定如何将这些结果转换为我需要的简单列。

如果我使用找到here的分类函数,它只匹配每个字符串的一个类别:

sapply(category_df$items, grepl, recipes_df$recipes)

同样,找到的函数here最接近我要找的东西,但我不明白为什么类别数字会映射它们的方式。我预计蔬菜类别将是1而不是2,乳制品将是5而不是3。

categorize_food <- function(df, searchString, category) {
  df$category <- "OTHER"
  for(i in seq_along(searchString)) {
    list <- grep(searchString[i], df[,1], ignore.case=TRUE) 
    if (length(list) > 0) {
  df$category[list] <- category[i]
    }
  }
  df
}
recipes_cat <- categorize_food(recipes_df, category_df$items, category_df$category)

2 个答案:

答案 0 :(得分:1)

对于大型数据集,接近结尾的聚合有点慢,所以或许可以通过更快的方式(data.table?)将行转换为字符串,但这通常应该有效:

tmplist <- strsplit(items, "|", fixed=TRUE)
#Removes horrid '|' separated values into neat rows
searchterms <- data.frame(category=rep(category, sapply(tmplist, length)),
           items=unlist(tmplist), stringsAsFactors=FALSE)
#Recreates data frame, neatly
res <- lapply(searchterms$items, grep, x=recipes, value=TRUE)
#throws an lapply on the neat data pattern against recipes

matched_times <- sapply(res, length)
df_matched <- data.frame( category = rep(searchterms$category[matched_times!=0],
                                 matched_times[matched_times != 0]),
                  recipes = unlist(res))
# Combines category names the correct nr of times with grep
#results (recipe names), to create a tidy result 

df_ummatched <- data.frame( category = NA, recipes = recipes[!recipes %in% unlist(res)])
df <- rbind(df_matched, df_ummatched)
#gets the nonmatched, plops it in with NA values. 

final  <- aggregate(category~recipes, data=df, paste, sep=",", na.action=na.pass)
#makes the data untidy, as you asked. 

但是这仍然留下了重复的vegetable, vegetable条目。不能那样:

SplitFunction <- function(x) {
  b <- unlist(strsplit(x, ','))
  c <- b[!duplicated(b)]
  return(paste(c, collapse=", "))
}
SplitFunctionV <- Vectorize(SplitFunction)
final$category <- SplitFunctionV(final$category)

结果:

final
                              recipes               category
1                  a bunch of bananas                  fruit
2                 broccoli or spinach              vegetable
3                     fresh asparagus              vegetable
4  I like apples, asparagus, and pork vegetable, fruit, meat
5                     meats like lamb                   meat
6                      one pound pork                   meat
7           scrambled or poached eggs                  dairy
8           sourdough english muffins                  bread
9           spinach and arugula salad              vegetable
10     venison sausage and fried eggs            meat, dairy
11  no fruits, no veggies, no nothing                     NA

答案 1 :(得分:1)

这是一个非常简单的tidyverse选项:

library(tidyverse)

# reformat category data frame so each item has its own line: 
category_df <- 
category_df %>% 
  mutate(items = str_split(items, "\\|")) %>%
  unnest()

# then use string_extract_all() to find every item in each recipe string:
recipes_df %>% 
  mutate(recipe_category = str_extract_all(recipes, paste(category_df$items, collapse = '|')))