迭代组并计算数据帧之间的匹配

时间:2017-01-26 08:01:38

标签: r

如果问题的标题不那么清楚,请道歉。

我有两个data frame,如下所示:

df1
NAME   FOLLOWS
san    big supa
san    EAU
san    simulate
san    spang
glyn   guido
glyn   claire
glyn   vincent
glyn   dan
glyn   peter
glyn   EAU


df2
FOLLOWS
guido
vincent
EAU
EUSC
brian
simulate
peter

我希望countdf1$FOLLOWS之间的df2$FOLLOWS匹配NAME中的每个df1以及df1$FOLLOWS的长度NAME df1中的df3 NAME LENGTH_FOLLOWS COUNT_Match san 4 2 glyn 6 4 。对于这些数据框架,我期待这样的输出:

# Using list
cursor.execute(
    "SELECT age FROM user WHERE %(names) = '{}' OR user.name IN %(names)s",
    {'names': []},
)

# Using tuple
cursor.execute(
    "SELECT age FROM user WHERE %(names) = () OR user.name IN %(names)s",
    {'names': ()},
)

# Using both list and tuple
cursor.execute(
    "SELECT age FROM user WHERE %(names_l) = '{}' OR user.name IN %(names_t)s",
    {'names_l': [], 'names_t': ()},
)

2 个答案:

答案 0 :(得分:1)

您可以先将df1与df2合并,这样只会保留df1中的值。然后你可以简单地计算实例。

library(sqldf)
sqldf('select NAME, count(NAME) as LENGTH_FOLLOWS , count(Actual_F) as COUNT_Match from (select t1.*, t2.FOLLOWS as Actual_F from df1 t1 left join df2 t2 on t1.FOLLOWS=t2.FOLLOWS) group by NAME')

或使用基础R

df1$index=match(df1$FOLLOWS, df2$FOLLOWS)
aggregate(cbind(df1$FOLLOWS,df1$index), by = list(df1$NAME) , FUN = function(x) length(x[!is.na(x)]))

答案 1 :(得分:1)

以下是使用data.table的选项。将第一个data.frame转换为'data.table'(setDT(df1))并使用'df2'连接on以创建索引列('ind')。然后,按'NAME'分组,我们得到'ind'中非NA元素的逻辑向量的行数(.N)和sum

library(data.table)
setDT(df1)[df2, ind := 1, on = .(FOLLOWS)]
df1[, .(LENGTH_FOLLOWS = .N, COUNT_MATCH = sum(!is.na(ind))), NAME]
#   NAME LENGTH_FOLLOWS COUNT_MATCH
#1:  san              4           2
#2: glyn              6           4