从两列数据框中唯一的列中获取公共元素的数量?

时间:2017-10-08 00:17:49

标签: r

我有两个数据帧,其中有一个列在两个数据帧中都是通用的。我试图通过其他列获取共享元素的数量。以下是我的输入和输出。

super

我尝试了内部加入输入数据框(使用DF 1 America AmericanSports 1 American1 Soccer 2 American1 Chess 3 American1 Tennis 4 American2 Chess 5 American2 Tennis 6 American3 Soccer 7 American3 Tennis DF2 Russia RussianSports 1 Russian1 Soccer 2 Russian1 Tennis 3 Russian2 Chess 4 Russian2 Tennis 5 Russian3 Chess 6 Russian3 Tennis 7 Russian3 Soccer Output ANames RNames Common_Sports_Count 1 American1 Russian1 2 2 American1 Russian2 2 3 American1 Russian3 2 4 American2 Russian1 1 5 American2 Russian2 2 6 American2 Russian3 3 7 American3 Russian1 2 8 American3 Russian2 1 9 American3 Russian3 2 ),但却无法正确获取逻辑。我知道这很简单,但却无法思考正确的逻辑。

2 个答案:

答案 0 :(得分:2)

只需merge然后aggregate

mdf <- merge(df1, df2, by.x="AmericanSports", by.y="RussianSports")
aggdf <- aggregate(.~America+Russia, mdf, FUN=length)

names(aggdf)[ncol(aggdf)] <- "Common_Sports_Count"

aggdf
#     America   Russia Common_Sports_Count
# 1 American1 Russian1                   2
# 2 American2 Russian1                   1
# 3 American3 Russian1                   2
# 4 American1 Russian2                   2
# 5 American2 Russian2                   2
# 6 American3 Russian2                   1
# 7 American1 Russian3                   3
# 8 American2 Russian3                   2
# 9 American3 Russian3                   2

要在大型数据集上加快聚合,请考虑data.table包:

library(data.table)

dt <- data.table(mdf)
aggdt <- dt[, list(Common_Sports_Count=length(AmericanSports)), by='America,Russia']
aggdt

#      America   Russia Common_Sports_Count
# 1: American1 Russian2                   2
# 2: American1 Russian3                   3
# 3: American2 Russian2                   2
# 4: American2 Russian3                   2
# 5: American1 Russian1                   2
# 6: American3 Russian1                   2
# 7: American3 Russian3                   2
# 8: American2 Russian1                   1
# 9: American3 Russian2                   1

答案 1 :(得分:1)

使用dplyr(PS:Drop列使用df[,'columns']=NULL

library(dplyr)
df1=df1%>%group_by(America)%>%dplyr::summarise(Sports=list(AmericanSports))
df2=df2%>%group_by(Russia)%>%dplyr::summarise(Sports=list(RussianSports))
df=expand.grid(df1$America,df2$Russia)
df=merge(df,df1,by.x='Var1',by.y='America')
df=merge(df,df2,by.x='Var2',by.y='Russia')

df['Common_Sports_Count']=apply(df[,c('Sports.x','Sports.y')],1,function(x) length(intersect(x[1][[1]],x[2][[1]])))

df
      Var2      Var1              Sports.x              Sports.y Common_Sports_Count
1 Russian1 American1 Soccer, Chess, Tennis        Soccer, Tennis                   2
2 Russian1 American2         Chess, Tennis        Soccer, Tennis                   1
3 Russian1 American3        Soccer, Tennis        Soccer, Tennis                   2
4 Russian2 American1 Soccer, Chess, Tennis         Chess, Tennis                   2
5 Russian2 American3        Soccer, Tennis         Chess, Tennis                   1
6 Russian2 American2         Chess, Tennis         Chess, Tennis                   2
7 Russian3 American1 Soccer, Chess, Tennis Chess, Tennis, Soccer                   3
8 Russian3 American3        Soccer, Tennis Chess, Tennis, Soccer                   2
9 Russian3 American2         Chess, Tennis Chess, Tennis, Soccer                   2

编辑以获得结果

df=df[,c('Var2','Var1','Common_Sports_Count')]

names(df)=c('Russia','America','Common_Sports_Count')

df[order(df$America),]

    Russia   America Common_Sports_Count
1 Russian1 American1                   2
4 Russian2 American1                   2
7 Russian3 American1                   3
2 Russian1 American2                   1
6 Russian2 American2                   2
9 Russian3 American2                   2
3 Russian1 American3                   2
5 Russian2 American3                   1
8 Russian3 American3                   2