我有一个数据框 df1 ,用于报告学生已经学过的课程,其中ID是学生的ID,COURSES是学生所选课程的列表,TYPE和MAJOR是学生属性。数据框如下所示:
ID COURSES TYPE MAJOR
1 ['Intr To Archaeology', 'Statics', 'Circuits I…] Freshman EEEL
2 ['Signals & Systems I', ‘Instrumentation’…] Transfer EEEL
3 ['Keyboard Competence', 'Elementary … ] Freshman EEEL
4 ['Cultural Anthro', 'Vector Analysis’ … ] Freshma EEEL
我创建了一个新的数据框 df2 ,根据他们所学的课程报告每对学生的差异度量。 df2 如下所示:
我使用以下脚本创建,但运行速度非常慢(有数千名学生)。有人可以建议一种更有效的方法来创建 df2 吗?
一个主要问题是下面的脚本计算(学生1和学生2)和(学生2和学生1)之间的距离,这是多余的,因为距离是相同的。但是,我创建的条件是为了防止这种情况:
if (id1 >= id2):
continue
不起作用。
整个脚本:
for id1, student1 in df.iterrows():
for id2, student2 in df.iterrows():
if (id1 >= id2):
continue
ID_1 = student1["ID"]
ID_2 = student2["ID"]
# courses as list strings
s1 = student1["COURSES"]
s2 = student2["COURSES"]
try:
# courses as sets
courses1 = set(ast.literal_eval(s1))
courses2 = set(ast.literal_eval(s2))
distance = float(len(courses1.symmetric_difference(courses2)))/(len(courses1) + len(courses2))
except:
# Some strings seem to have a different format
distance = -1
ID_1_Transfer = 1 if student1["TYPE"] == "Transfer" else 0
ID_2_Transfer = 1 if student2["TYPE"] == "Transfer" else 0
df2= df2.append({'ID_1': ID_1,'ID_2': PIDM_2,'Distance': distance, 'ID_1_Transfer': ID_1_Transfer, 'ID_2_Transfer': ID_2_Transfer}, ignore_index=True)