合并具有频率计数的熊猫数据帧

时间:2019-03-20 02:21:49

标签: python-3.x pandas dataframe

我有一个数据框(df1),其中包含学生详细信息,例如-

Student ID     Course Code       Mark
   1              C001            88  
   1              C002            71
   2              C003            67
   3              C002            92
   3              C001            66
   3              C004            70
   4              C004            65

和另一个具有

的数据框(df2)
WR ID        K ID        Course Code
SP-RS-01     K001        C002, C004
SP-RS-01     K004        C002
SP-RS-02     K005
SP-RS-03     K004        C003, C004
SP-RS-03     K006        C001

现在,我需要一个数据框,其中包含每个学生ID的KID和WR ID,具体取决于他们参加的课程。如果他们不止一次提到计数,则可能要提及计数(作为字典)。所以,也许像这样-

Student ID       Courses           KID              WR ID
  1             C001, C002        K006, K001, K004  SP-RS-03
  2             C003              K004              SP-RS-01, SP-RS-03
  3             C001, C002, C004  K001x2, K006      SP-RS-01, SP-RS-03, 
                                  K004x2
  4             C004              K004              SP-RS-01, SP-RS-03

我该怎么做?

1 个答案:

答案 0 :(得分:2)

您可以使用:

#first flatten values pslitted by ,
s = (df2.set_index(['WR ID','K ID'])['Course Code']
        .str.split(',\s+', expand=True)
        .stack()
        .reset_index(level=2, drop=True)
        .rename('Course Code')
        )
#print (s)

#aggregate list per Course Code
df2 = (df2.drop('Course Code', axis=1)
          .join(s, on=['WR ID','K ID'])
          .groupby('Course Code')
          .agg(list)
          .reset_index()
          )

print (df2)
  Course Code                 WR ID          K ID
0        C001            [SP-RS-03]        [K006]
1        C002  [SP-RS-01, SP-RS-01]  [K001, K004]
2        C003            [SP-RS-03]        [K004]
3        C004  [SP-RS-01, SP-RS-03]  [K001, K004]

from collections import Counter

#combination flattening nested lists, Counter and new format with counts
f = lambda x: ', '.join(f'{k}x{v}' if v > 1 else k 
                        for k, v in Counter([z for y in x for z in y]).items())
#merge together and aggregate again
df = (df1.merge(df2, on='Course Code', how='left')
         .groupby('Student ID')
         .agg({'Course Code':', '.join,
               'WR ID':f,
               'K ID':f})
         .reset_index()
      )
print (df)
   Student ID       Course Code                   WR ID                  K ID
0           1        C001, C002    SP-RS-03, SP-RS-01x2      K006, K001, K004
1           2              C003                SP-RS-03                  K004
2           3  C002, C001, C004  SP-RS-01x3, SP-RS-03x2  K001x2, K004x2, K006
3           4              C004      SP-RS-01, SP-RS-03            K001, K004

编辑:

问题是一些缺失值,解决方案是将它们重新填充为空列表:

from collections import Counter

#combination flattening nested lists, Counter and new format with counts
f = lambda x: ', '.join(f'{k}x{v}' if v > 1 else k 
                        for k, v in Counter([z for y in x for z in y]).items())

#merge together and aggregate again
df = df1.merge(df2, on='Course Code', how='left')
df[['WR ID','K ID']] = df[['WR ID','K ID']].applymap(lambda x: x if x==x else [])

df = (df.groupby('Student ID')
        .agg({'Course Code':', '.join,
               'WR ID':f,
               'K ID':f})
         .reset_index()
      )