熊猫-二进制矩阵和字符串数据(如何堆叠字符串数据/换行?)

时间:2018-08-20 20:05:01

标签: python pandas

我正在尝试生成特别结构化的数据框,但是我似乎无法“堆叠”数据。我的原始数据样本:

# raw data
df = pd.DataFrame({'Name':['name1', 'name2', 'name3', 'name1', 'name2', 'name3', 'name1', 'name2', 'name3' ], 
                   'Year':['freshman','sophomore','freshman', 'freshman','sophomore','freshman', 'freshman','sophomore','freshman'], 
                   'Rotation':['ERJD','PEDI','MAM','PEDI', 'ERJD','PEDI','MAM','ERJD','ABD'],
                   'Week1':[1,1,1,0,0,0,0,0,0],
                   'Week2':[0,0,0,1,1,1,0,0,0],
                   'Week3':[0,0,0,0,0,0,1,1,1],
                   'Week4':[1,0,0,0,0,0,0,1,1]
                  })
df = df[['Name','Year','Rotation','Week1','Week2','Week3','Week4']]

外观如下:

    Name    Year    Rotation    Week1   Week2   Week3   Week4
0   name1   freshman    ERJD      1       0       0       1
1   name2   sophomore   PEDI      1       0       0       0
2   name3   freshman    MAM       1       0       0       0
3   name1   freshman    PEDI      0       1       0       0
4   name2   sophomore   ERJD      0       1       0       0
5   name3   freshman    PEDI      0       1       0       0
6   name1   freshman    MAM       0       0       1       0
7   name2   sophomore   ERJD      0       0       1       1
8   name3   freshman    ABD       0       0       1       1

我重塑了数据框:

#Reshape Table + Filtering
df = pd.melt(df, 
             id_vars=['Name','Year','Rotation'], 
             value_vars=list(df.columns[3:]),
             var_name='Week', 
             value_name='Sum of Value')

df = df.loc[df['Sum of Value'] == 1].reset_index()
df.pop('index')

哪个生成:

    Name    Year    Rotation    Week    Sum of Value
0   name1   freshman    ERJD    Week1       1
1   name2   sophomore   PEDI    Week1       1
2   name3   freshman    MAM     Week1       1
3   name1   freshman    PEDI    Week2       1
4   name2   sophomore   ERJD    Week2       1
5   name3   freshman    PEDI    Week2       1
6   name1   freshman    MAM     Week3       1
7   name2   sophomore   ERJD    Week3       1
8   name3   freshman    ABD     Week3       1
9   name1   freshman    ERJD    Week4       1
10  name2   sophomore   ERJD    Week4       1
11  name3   freshman    ABD     Week4       1

我创建一个数据透视表:

#Create Pivot
pivot = df.pivot_table(index=['Rotation','Year'], columns='Week', values='Name', aggfunc=lambda x: ' '.join(x))
pivot = pivot.reindex(weeks, axis=1) # Change order of Columns
pivot

哪个生成:

                    Week1       Week2      Week3    Week4
Rotation    Year                
ABD       freshman   None        None      name3    name3
ERJD      freshman  name1        None       None    name1
          sophomore  None       name2      name2    name2
MAM       freshman  name3        None      name1     None
PEDI      freshman   None  name1 name3      None     None
          sophomore name2        None       None     None

我想将表中的名称堆叠在一起,例如 Week2 PEDI有 name1name3 并排放置。如何将名称放在不同的行上?有没有比使用数据透视表更好的方法了? pd.melt步骤是否甚至必要?

所需结构:

                    Week1       Week2      Week3    Week4
Rotation    Year                
ABD       freshman   None        None      name3    name3
ERJD      freshman  name1        None       None    name1
          sophomore  None       name2      name2    name2
MAM       freshman  name3        None      name1     None
PEDI      freshman   None        name1      None     None    
                                 name3
          sophomore name2        None       None     None

预先感谢您的帮助!

解决方案:

pd.melt之后,执行以下操作:

df['aggval'] = df['Week'].map(str) + df['Rotation']
df['aggval'] = df.groupby(['aggval']).cumcount()+1
pivot = df.pivot_table(index=['Rotation','aggval'], columns='Week', values='Name', aggfunc=lambda x: ' '.join(x)).fillna('')
pivot = pivot.reindex(weeks, axis=1)

3 个答案:

答案 0 :(得分:0)

您可以遍历感兴趣的几周,并有条件地填充数据框,如下所示:

for week in ['Week1','Week2','Week3','Week4']:
    df[week] = np.where(df[week]==1, df['Name'], df[week])

这给出了:

    Name      Year Rotation  Week1  Week2  Week3  Week4
0  name1  freshman     ERJD  name1      0      0  name1
1  name2  sophmore     PEDI  name2      0      0      0
2  name3  freshman      MAM  name3      0      0      0
3  name1  freshman     PEDI      0  name1      0      0
4  name2  sophmore     ERJD      0  name2      0      0
5  name3  freshman     PEDI      0  name3      0      0
6  name1  freshman      MAM      0      0  name1      0
7  name2  sophmore     ERJD      0      0  name2  name2
8  name3  freshman      ABD      0      0  name3  name3

然后,您可以对数据框进行分组,并将字符串类型的条目存储在列表中:

grouped = df.drop('Name', axis=1).groupby(['Rotation','Year']).agg(lambda x: [i for i in x if type(i)==str])

哪个给:

                     Week1           Week2    Week3    Week4
Rotation Year                                               
ABD      freshman       []              []  [name3]  [name3]
ERJD     freshman  [name1]              []       []  [name1]
         sophmore       []         [name2]  [name2]  [name2]
MAM      freshman  [name3]              []  [name1]       []
PEDI     freshman       []  [name1, name3]       []       []
         sophmore  [name2]              []       []       []

请注意,OP的所需输出中有错误。没有('MAM','sophmore')组。另外请注意,为清楚起见,'sophmore'的拼写为'sophomore'

答案 1 :(得分:0)

您可以使用set_indexmul进行此操作:

df1 = df.set_index(['Rotation','Year'])

df1.filter(like='Week').mul(df1['Name'], axis=0)\
  .replace('',np.nan)\
  .sort_index()

输出:

                     Week1  Week2  Week3  Week4
Rotation Year                                 
ABD      freshman     NaN    NaN  name3  name3
ERJD     freshman   name1    NaN    NaN  name1
         sophomore    NaN  name2    NaN    NaN
         sophomore    NaN    NaN  name2  name2
MAM      freshman   name3    NaN    NaN    NaN
         freshman     NaN    NaN  name1    NaN
PEDI     freshman     NaN  name1    NaN    NaN
         freshman     NaN  name3    NaN    NaN
         sophomore  name2    NaN    NaN    NaN

答案 2 :(得分:0)

在pd.melt之后,请执行以下操作:

df['aggval'] = df['Week'].map(str) + df['Rotation']
df['aggval'] = df.groupby(['aggval']).cumcount()+1
pivot = df.pivot_table(index=['Rotation','aggval'], columns='Week', values='Name', aggfunc=lambda x: ' '.join(x)).fillna('')
pivot = pivot.reindex(weeks, axis=1)