pandas行到列

时间:2017-06-22 13:52:50

标签: python pandas

我在pandas dataframe中有以下信息为df,我正在尝试将代码值作为每个adm_id的列值,并命名列及其位置。

ID  ADM_ID  code
108 183350  7100
108 183350  5849
108 183350  5780
108 183350  99811
108 183350  4466
108 183350  40301
108 183350  58281
108 183350  E8798
108 183350  58889
108 183350  4430
108 183350  78659
109 128755  4372
109 128755  78039
109 128755  7100
109 128755  40391
109 128755  4251
109 128755  2859
109 164029  40301
109 164029  7100
109 164029  5856
109 164029  V4983
109 164029  58381
109 164029  3643
109 108375  7100
109 108375  40301
109 108375  5856
109 108375  58381
109 108375  3643
109 108375  28521
109 193281  40301
109 193281  5856
109 193281  7100
109 193281  7907
109 193281  4254
109 193281  99662
109 193281  99812
109 193281  36001
109 193281  11289
109 193281  V5865
109 193281  7821
109 193281  28521
109 193281  37900
109 193281  37632
109 193281  37005
109 193281  36400 

我想将其转换如下,

ID  ADM_ID  cnt code1   code2   code3   code4   code5   code6   code7   code8   code9   code10  code11  code12  code13  code14  code15  code16
108 183350  11  7100    5849    5780    99811   4466    40301   58281   E8798   58889   4430    78659                   
109 128755  6   4372    78039   7100    40391   4251    2859                                        
109 164029  6   40301   7100    5856    V4983   58381   3643                                        
109 108375  6   7100    40301   5856    58381   3643    28521                                       
109 193281  16  40301   5856    7100    7907    4254    99662   99812   36001   11289   V5865   7821    28521   37900   37632   37005   36400

我不能保证代码只有16个,每个id都有随机数量的代码。有人可以帮我做这件事。

谢谢,

4 个答案:

答案 0 :(得分:0)

基于什么原则为列分配基准?根据您发布的数据,不清楚数据分配给新列的依据。

似乎.pivot()函数在这里运行良好。但是,您需要添加一些键或.groupby()参数来重塑数据帧。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html

https://pandas.pydata.org/pandas-docs/stable/reshaping.html

更新

首先通过分组辩论 -

new_df = old_df.groupby('admin_id')

然后传递一个pivot参数 -

newer_df = new_dt.pivot('code')

答案 1 :(得分:0)

您可以使用:

df1 = df.groupby(['ID','ADM_ID'])['code'].apply(list)
l = df1.str.len()

df = pd.DataFrame(df1.values.tolist(), 
                  index=df1.index, 
                  columns = range(1, l.max() +1)) \
       .add_prefix('code') \
       .reset_index()

df.insert(2, 'cnt', l.values)
print (df)
    ID  ADM_ID  cnt  code1  code2 code3  code4  code5  code6  code7  code8  \
0  108  183350   11   7100   5849  5780  99811   4466  40301  58281  E8798   
1  109  108375    6   7100  40301  5856  58381   3643  28521   None   None   
2  109  128755    6   4372  78039  7100  40391   4251   2859   None   None   
3  109  164029    6  40301   7100  5856  V4983  58381   3643   None   None   
4  109  193281   16  40301   5856  7100   7907   4254  99662  99812  36001   

   code9 code10 code11 code12 code13 code14 code15  code16  
0  58889   4430  78659   None   None   None   None    None  
1   None   None   None   None   None   None   None    None  
2   None   None   None   None   None   None   None    None  
3   None   None   None   None   None   None   None    None  
4  11289  V5865   7821  28521  37900  37632  37005  36400   

答案 2 :(得分:0)

方法1

使用groupby

执行此操作可能会有所帮助
df2 = df.groupby(['ID', 'ADM_ID'])['code'].agg([np.count_nonzero,
                                                lambda x: tuple(x)])
df3 = pd.concat([df2.reset_index(),
                 pd.DataFrame(df2['<lambda>'].tolist())],
                axis=1)
del df3['<lambda>']
cols = ['ID', 'ADM_ID', 'cnt']
cols.extend(['code'+str(i) for i in range(1, len(df3.columns)-2)])
df3.columns = cols
df3
Out[52]: 
    ID  ADM_ID  cnt  code1  code2 code3  code4  code5  code6  code7  code8  \
0  108  183350   11   7100   5849  5780  99811   4466  40301  58281  E8798   
1  109  108375    6   7100  40301  5856  58381   3643  28521   None   None   
2  109  128755    6   4372  78039  7100  40391   4251   2859   None   None   
3  109  164029    6  40301   7100  5856  V4983  58381   3643   None   None   
4  109  193281   16  40301   5856  7100   7907   4254  99662  99812  36001   

   code9 code10 code11 code12 code13 code14 code15 code16  
0  58889   4430  78659   None   None   None   None   None  
1   None   None   None   None   None   None   None   None  
2   None   None   None   None   None   None   None   None  
3   None   None   None   None   None   None   None   None  
4  11289  V5865   7821  28521  37900  37632  37005  36400 

编辑:方法2

如果您可以获得完整的数字代码,我只是因为您看起来如此接近而提及,那么您可以改为使用pivotnp.sort

df2 = df.pivot(index='ADM_ID', columns='code', values='code')
df2.values.sort()
df2.dropna(how='all', axis=1, inplace=True)
df2.columns = ['code'+str(i) for i in range(1, len(df2.columns)+1)]
df2.insert(0, 'cnt', df2.count(axis=1))
df2
Out[71]: 
        cnt   code1   code2   code3    code4    code5    code6    code7  \
ADM_ID                                                                    
108375    6  3643.0  5856.0  7100.0  28521.0  40301.0  58381.0      NaN   
128755    6  2859.0  4251.0  4372.0   7100.0  40391.0  78039.0      NaN   
164029    6  3643.0  4983.0  5856.0   7100.0  40301.0  58381.0      NaN   
183350   11  4430.0  4466.0  5780.0   5849.0   7100.0   8798.0  40301.0   
193281   16  4254.0  5856.0  5865.0   7100.0   7821.0   7907.0  11289.0   

          code8    code9   code10   code11   code12   code13   code14  \
ADM_ID                                                                  
108375      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
128755      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
164029      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
183350  58281.0  58889.0  78659.0  99811.0      NaN      NaN      NaN   
193281  28521.0  36001.0  36400.0  37005.0  37632.0  37900.0  40301.0   

         code15   code16  
ADM_ID                    
108375      NaN      NaN  
128755      NaN      NaN  
164029      NaN      NaN  
183350      NaN      NaN  
193281  99662.0  99812.0  

EV被剥离的情况下,这个数据的速度有点快(2.63 ms vs 4.84 ms),大约快12倍(5.92 ms vs 74.2 ms)在具有相同数量的adm_ids和100倍代码的数据框架上进行测试。

不幸的是,numpy数组的sort似乎与字符串数组中的NaN很好地匹配,而且我注意到的任何变通方法似乎都比groupby更昂贵。

答案 3 :(得分:0)

使用方法链接一次完成此操作。第一组&#39; ID&#39;和&#39; ADM_ID&#39;,然后转换组值并计数到列表,接下来将列表转换为列,添加前缀,重命名计数列名称,最后重置索引。

此解决方案将自动处理更多或更少的列。

(df.groupby(['ID','ADM_ID'])
    .apply(lambda x: [len(x)]+x.code.tolist())
    .apply(pd.Series)
    .add_prefix('code')
    .rename(columns={'code0':'cnt'})
    .reset_index()
)

Out[389]: 
    ID  ADM_ID  cnt  code1  code2 code3  code4  code5  code6  code7  code8  \
0  108  183350   11   7100   5849  5780  99811   4466  40301  58281  E8798   
1  109  108375    6   7100  40301  5856  58381   3643  28521    NaN    NaN   
2  109  128755    6   4372  78039  7100  40391   4251   2859    NaN    NaN   
3  109  164029    6  40301   7100  5856  V4983  58381   3643    NaN    NaN   
4  109  193281   16  40301   5856  7100   7907   4254  99662  99812  36001   

   code9 code10 code11 code12 code13 code14 code15 code16  
0  58889   4430  78659    NaN    NaN    NaN    NaN    NaN  
1    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  
3    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  
4  11289  V5865   7821  28521  37900  37632  37005  36400