Question

我有一个数据框df，其中的单词以+分隔，但我不希望在执行分析时顺序无关紧要。例如，我有

df <- as.data.frame(
      c(("Yellow + Blue + Green"),
        ("Blue + Yellow + Green"),
        ("Green + Yellow + Blue")))

目前，它们被视为三个独特的回复，但我希望它们被认为是相同的。我尝试过诸如ifelse语句之类的蛮力方法，但它们并不适合大型数据集。

有没有办法重新排列条款，使它们匹配或类似反向combn函数，可以识别它们是相同的组合，但顺序不同？

谢谢！

Answer 1

#DATA
df <- data.frame(cols = 
                 c(("Yellow + Blue + Green"),
                   ("Blue + Yellow + Green"),
                   ("Green + Yellow + Blue"),
                   ("Green + Yellow + Red")), stringsAsFactors = FALSE)

#Split, sort, and then paste together
df$group = sapply(df$cols, function(a)
    paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))
df
#                   cols               group
#1 Yellow + Blue + Green Blue, Green, Yellow
#2 Blue + Yellow + Green Blue, Green, Yellow
#3 Green + Yellow + Blue Blue, Green, Yellow
#4  Green + Yellow + Red  Green, Red, Yellow

#Or you can convert to factors too (and back to numeric, if you like)
df$group2 = as.numeric(as.factor(sapply(df$cols, function(a)
        paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))))
df
#                   cols               group group2
#1 Yellow + Blue + Green Blue, Green, Yellow      1
#2 Blue + Yellow + Green Blue, Green, Yellow      1
#3 Green + Yellow + Blue Blue, Green, Yellow      1
#4  Green + Yellow + Red  Green, Red, Yellow      2

Answer 2

我想提供我对此的看法，因为它不清楚您想要输出的格式：

我使用了包stringr和iterators。使用df

创建的d.b.

search <- c("Yellow", "Green", "Blue")
L <- str_extract_all(df$cols, boundary("word"))
sapply(iter(L), function(x) all(search %in% x))
[1]  TRUE  TRUE  TRUE FALSE

如何匹配R中不同组合的字符串

2 个答案: