在数据框中查找重叠元素

时间:2018-04-25 09:36:04

标签: r dataframe

我的数据框'df'具有以下结构:

假设存在4个不同的商店和标题

Title Store
T1    S1
T1    S2
T1    S3
T1    S4
T2    S1
T2    S2
T2    S4
T3    S1
T3    S4
T4    S1
T4    S2

问题:

我想找到所有标题组合的常用商店

预期输出:

Title_combination     Common_Store      
T1,T2,T3,T4           S1     
T1,T2,T3              S1,S4
T1,T2,T4              S1,S2
........             ...... so on

1 个答案:

答案 0 :(得分:1)

使用base个功能。内联说明。

数据:

tbl <- read.table(text="Title Store
T1    S1
T1    S2
T1    S3
T1    S4
T2    S1
T2    S2
T2    S4
T3    S1
T3    S4
T4    S1
T4    S2", header=TRUE)

运作:

#get unique titles
titles <- unique(tbl$Title)

#combine rows into a single data.frame
do.call(rbind, unlist(
    #for each set of n titles
    lapply(seq_along(titles), function(n)
        #using combn to generate combi and apply function to each combi
        combn(titles, n, function(subtitles) {
            #recursively intersect all stores for each title within the set subtitles 
            cstores <- Reduce(function(s, t2) intersect(s, tbl$Store[tbl$Title==t2]), 
                subtitles[-1], 
                tbl$Store[tbl$Title==subtitles[1]])
            data.frame(
                Title_combi=paste(subtitles, collapse=","),
                Common_Store=paste(cstores, collapse=",")
            )
        }, simplify=FALSE) #dont simplify results from combn
    ), 
    recursive=FALSE)) #unlist 1 level of combi results

结果:

#    Title_combi Common_Store
# 1           T1  S1,S2,S3,S4
# 2           T2     S1,S2,S4
# 3           T3        S1,S4
# 4           T4        S1,S2
# 5        T1,T2     S1,S2,S4
# 6        T1,T3        S1,S4
# 7        T1,T4        S1,S2
# 8        T2,T3        S1,S4
# 9        T2,T4        S1,S2
# 10       T3,T4           S1
# 11    T1,T2,T3        S1,S4
# 12    T1,T2,T4        S1,S2
# 13    T1,T3,T4           S1
# 14    T2,T3,T4           S1
# 15 T1,T2,T3,T4           S1