我有一个数据框(df1)
,其中包含学生详细信息,例如-
Student ID Course Code Mark
1 C001 88
1 C002 71
2 C003 67
3 C002 92
3 C001 66
3 C004 70
4 C004 65
和另一个具有
的数据框(df2)
WR ID K ID Course Code
SP-RS-01 K001 C002, C004
SP-RS-01 K004 C002
SP-RS-02 K005
SP-RS-03 K004 C003, C004
SP-RS-03 K006 C001
现在,我需要一个数据框,其中包含每个学生ID的KID和WR ID,具体取决于他们参加的课程。如果他们不止一次提到计数,则可能要提及计数(作为字典)。所以,也许像这样-
Student ID Courses KID WR ID
1 C001, C002 K006, K001, K004 SP-RS-03
2 C003 K004 SP-RS-01, SP-RS-03
3 C001, C002, C004 K001x2, K006 SP-RS-01, SP-RS-03,
K004x2
4 C004 K004 SP-RS-01, SP-RS-03
我该怎么做?
答案 0 :(得分:2)
您可以使用:
#first flatten values pslitted by ,
s = (df2.set_index(['WR ID','K ID'])['Course Code']
.str.split(',\s+', expand=True)
.stack()
.reset_index(level=2, drop=True)
.rename('Course Code')
)
#print (s)
#aggregate list per Course Code
df2 = (df2.drop('Course Code', axis=1)
.join(s, on=['WR ID','K ID'])
.groupby('Course Code')
.agg(list)
.reset_index()
)
print (df2)
Course Code WR ID K ID
0 C001 [SP-RS-03] [K006]
1 C002 [SP-RS-01, SP-RS-01] [K001, K004]
2 C003 [SP-RS-03] [K004]
3 C004 [SP-RS-01, SP-RS-03] [K001, K004]
from collections import Counter
#combination flattening nested lists, Counter and new format with counts
f = lambda x: ', '.join(f'{k}x{v}' if v > 1 else k
for k, v in Counter([z for y in x for z in y]).items())
#merge together and aggregate again
df = (df1.merge(df2, on='Course Code', how='left')
.groupby('Student ID')
.agg({'Course Code':', '.join,
'WR ID':f,
'K ID':f})
.reset_index()
)
print (df)
Student ID Course Code WR ID K ID
0 1 C001, C002 SP-RS-03, SP-RS-01x2 K006, K001, K004
1 2 C003 SP-RS-03 K004
2 3 C002, C001, C004 SP-RS-01x3, SP-RS-03x2 K001x2, K004x2, K006
3 4 C004 SP-RS-01, SP-RS-03 K001, K004
编辑:
问题是一些缺失值,解决方案是将它们重新填充为空列表:
from collections import Counter
#combination flattening nested lists, Counter and new format with counts
f = lambda x: ', '.join(f'{k}x{v}' if v > 1 else k
for k, v in Counter([z for y in x for z in y]).items())
#merge together and aggregate again
df = df1.merge(df2, on='Course Code', how='left')
df[['WR ID','K ID']] = df[['WR ID','K ID']].applymap(lambda x: x if x==x else [])
df = (df.groupby('Student ID')
.agg({'Course Code':', '.join,
'WR ID':f,
'K ID':f})
.reset_index()
)