Question

我正在尝试确定每个学生将如何在他们所从事的课程和他们已经完成的项目中进行头对头的表现。

scores.csv：https://pastebin.com/FxUCb4xT

import pandas as pd
df = pd.read_csv("Documents/scores.csv")

student_ids = df.student_id.unique()

for id in student_ids:
    to_analyse = pd.merge(df,df[df['student_id'] == id][['class_id','project_id']])

我最终会对to_analyse做自己的事情，但首先到达那里使用pd.merge非常慢，特别是如果有成千上万的独特学生IDS。

有更有效的方法吗？我尝试过使用数据透视表，但也许我正在使用这种方法咆哮错误的树。

Answer 1

我认为使用groupby更快一点：

def f(x):
    print (pd.merge(df, x[['class_id','project_id']]))

df = df.groupby('student_id').apply(f)

没有merge的解决方案，其中包含连接列，isin和boolean indexing：

df['both'] = df['class_id'].astype(str)  + '_' + df['project_id'].astype(str)

def f(x):
    print (df[df['both'].isin(x['both'])])

df = df.groupby('student_id').apply(f).drop('both', axis=1)

根据不相关列的唯一性查找两列的并集

1 个答案: