我要说的是,我并没有完全依赖于使用distinct()
解决我的问题,我对所有解决问题的建议持开放态度。这是拼图:
Date <- c(1,1,2,2)
Group <- c("A","A","B","B")
Result <- c("Aa","Ab","Aa","SB")
df <- cbind(Date, Group, Result)
df
Date Group Result
[1,] "1" "A" "Aa"
[2,] "1" "A" "Ab"
[3,] "2" "B" "Aa"
[4,] "2" "B" "SB"
我瞄准的结果是不同的Date
,因此选择包含Aa或Ab的任一行(子集),并且选择包含SB的任何行超过Aa或Ab或Ac要么 ... 。我以高效的方式为大型数据框执行此操作时遇到了很多麻烦。我没有质量尝试在这里展示。
实际上,Group
A和B有更多基于时间的观察,还有更多不同的群体。对于某个特定Date
,在同一个Group
上上传两次(或更多)数据时,实际上应该只有一个Date
条目,其中Result
更重要。
更新:
过滤后的上述输出的预期子集等:
Date Group Result
[1,] "1" "A" "Aa"
[2,] "2" "B" "SB"
OR
Date Group Result
[1,] "1" "A" "Ab"
[2,] "2" "B" "SB"
答案 0 :(得分:0)
使用dplyr
,但不是distinct
:
library(dplyr)
Date <- c(1,1,2,2)
Group <- c("A","A","B","B")
Result <- c("Aa","Ab","Aa","SB")
# Use data.frame, not cbind, as this produced a matrix
df <- data.frame(Date, Group, Result)
# To get your first answer
summarise(group_by(df, Date, Group),
Result = first(Result))
# To get your second answer
summarise(group_by(df, Date, Group),
Result = last(Result))
# To combine all the options
summarise(group_by(df, Date, Group),
Result = paste(Result, collapse = ", "))
答案 1 :(得分:0)
独特的结果需要按重要性排序。这可以手动完成或使用某种算法完成。两种方法如下所示。排名结果随后用于查找每个日期组组合的排名最高的结果。代码可能如下所示:
library(dplyr)
df <- data.frame(df)
#
# manually list unique Results in order of increasing importance
#
Result_rank <- c("Aa","Ab","SB")
#
# Or use an algorithm to rank unique Results in order of importance;
# For the example, the algorithm might be:
#
Result_rank <- c(grep("^A",unique(df$Result), value=TRUE),
grep("SB",unique(df$Result), value=TRUE))
#
# summarize by highest ranked Result for each Date and Group
#
df_important <- df %>% group_by( Date, Group) %>%
summarize(Result= Result_rank[max(match(Result, Result_rank))])
给出结果
Date Group Result
<fctr> <fctr> <chr>
1 1 A Ab
2 2 B SB