我有一个数据框df
,其中的单词以+
分隔,但我不希望在执行分析时顺序无关紧要。例如,我有
df <- as.data.frame(
c(("Yellow + Blue + Green"),
("Blue + Yellow + Green"),
("Green + Yellow + Blue")))
目前,它们被视为三个独特的回复,但我希望它们被认为是相同的。我尝试过诸如ifelse
语句之类的蛮力方法,但它们并不适合大型数据集。
有没有办法重新排列条款,使它们匹配或类似反向combn
函数,可以识别它们是相同的组合,但顺序不同?
谢谢!
答案 0 :(得分:6)
#DATA
df <- data.frame(cols =
c(("Yellow + Blue + Green"),
("Blue + Yellow + Green"),
("Green + Yellow + Blue"),
("Green + Yellow + Red")), stringsAsFactors = FALSE)
#Split, sort, and then paste together
df$group = sapply(df$cols, function(a)
paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))
df
# cols group
#1 Yellow + Blue + Green Blue, Green, Yellow
#2 Blue + Yellow + Green Blue, Green, Yellow
#3 Green + Yellow + Blue Blue, Green, Yellow
#4 Green + Yellow + Red Green, Red, Yellow
#Or you can convert to factors too (and back to numeric, if you like)
df$group2 = as.numeric(as.factor(sapply(df$cols, function(a)
paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))))
df
# cols group group2
#1 Yellow + Blue + Green Blue, Green, Yellow 1
#2 Blue + Yellow + Green Blue, Green, Yellow 1
#3 Green + Yellow + Blue Blue, Green, Yellow 1
#4 Green + Yellow + Red Green, Red, Yellow 2
答案 1 :(得分:0)
我想提供我对此的看法,因为它不清楚您想要输出的格式:
我使用了包stringr
和iterators
。使用df
d.b.
search <- c("Yellow", "Green", "Blue")
L <- str_extract_all(df$cols, boundary("word"))
sapply(iter(L), function(x) all(search %in% x))
[1] TRUE TRUE TRUE FALSE