我有两个数据帧,其中有一个列在两个数据帧中都是通用的。我试图通过其他列获取共享元素的数量。以下是我的输入和输出。
super
我尝试了内部加入输入数据框(使用DF 1
America AmericanSports
1 American1 Soccer
2 American1 Chess
3 American1 Tennis
4 American2 Chess
5 American2 Tennis
6 American3 Soccer
7 American3 Tennis
DF2
Russia RussianSports
1 Russian1 Soccer
2 Russian1 Tennis
3 Russian2 Chess
4 Russian2 Tennis
5 Russian3 Chess
6 Russian3 Tennis
7 Russian3 Soccer
Output
ANames RNames Common_Sports_Count
1 American1 Russian1 2
2 American1 Russian2 2
3 American1 Russian3 2
4 American2 Russian1 1
5 American2 Russian2 2
6 American2 Russian3 3
7 American3 Russian1 2
8 American3 Russian2 1
9 American3 Russian3 2
),但却无法正确获取逻辑。我知道这很简单,但却无法思考正确的逻辑。
答案 0 :(得分:2)
只需merge
然后aggregate
:
mdf <- merge(df1, df2, by.x="AmericanSports", by.y="RussianSports")
aggdf <- aggregate(.~America+Russia, mdf, FUN=length)
names(aggdf)[ncol(aggdf)] <- "Common_Sports_Count"
aggdf
# America Russia Common_Sports_Count
# 1 American1 Russian1 2
# 2 American2 Russian1 1
# 3 American3 Russian1 2
# 4 American1 Russian2 2
# 5 American2 Russian2 2
# 6 American3 Russian2 1
# 7 American1 Russian3 3
# 8 American2 Russian3 2
# 9 American3 Russian3 2
要在大型数据集上加快聚合,请考虑data.table
包:
library(data.table)
dt <- data.table(mdf)
aggdt <- dt[, list(Common_Sports_Count=length(AmericanSports)), by='America,Russia']
aggdt
# America Russia Common_Sports_Count
# 1: American1 Russian2 2
# 2: American1 Russian3 3
# 3: American2 Russian2 2
# 4: American2 Russian3 2
# 5: American1 Russian1 2
# 6: American3 Russian1 2
# 7: American3 Russian3 2
# 8: American2 Russian1 1
# 9: American3 Russian2 1
答案 1 :(得分:1)
使用dplyr
(PS:Drop列使用df[,'columns']=NULL
)
library(dplyr)
df1=df1%>%group_by(America)%>%dplyr::summarise(Sports=list(AmericanSports))
df2=df2%>%group_by(Russia)%>%dplyr::summarise(Sports=list(RussianSports))
df=expand.grid(df1$America,df2$Russia)
df=merge(df,df1,by.x='Var1',by.y='America')
df=merge(df,df2,by.x='Var2',by.y='Russia')
df['Common_Sports_Count']=apply(df[,c('Sports.x','Sports.y')],1,function(x) length(intersect(x[1][[1]],x[2][[1]])))
df
Var2 Var1 Sports.x Sports.y Common_Sports_Count
1 Russian1 American1 Soccer, Chess, Tennis Soccer, Tennis 2
2 Russian1 American2 Chess, Tennis Soccer, Tennis 1
3 Russian1 American3 Soccer, Tennis Soccer, Tennis 2
4 Russian2 American1 Soccer, Chess, Tennis Chess, Tennis 2
5 Russian2 American3 Soccer, Tennis Chess, Tennis 1
6 Russian2 American2 Chess, Tennis Chess, Tennis 2
7 Russian3 American1 Soccer, Chess, Tennis Chess, Tennis, Soccer 3
8 Russian3 American3 Soccer, Tennis Chess, Tennis, Soccer 2
9 Russian3 American2 Chess, Tennis Chess, Tennis, Soccer 2
编辑以获得结果
df=df[,c('Var2','Var1','Common_Sports_Count')]
names(df)=c('Russia','America','Common_Sports_Count')
df[order(df$America),]
Russia America Common_Sports_Count
1 Russian1 American1 2
4 Russian2 American1 2
7 Russian3 American1 3
2 Russian1 American2 1
6 Russian2 American2 2
9 Russian3 American2 2
3 Russian1 American3 2
5 Russian2 American3 1
8 Russian3 American3 2