Python 3.4 - Pandas - 帮助正确排列数据帧列和删除无效列

时间:2018-02-28 01:51:05

标签: python python-3.x pandas dataframe group-by

此问题基于我之前提到过的Python - Pandas - Combining rows of multiple columns into single row in dataframe based on categorical value

我有一个表格格式如下:

                  Var1     Var2      Var3      Var4    ID

          0    0.70089  0.93120  1.867650  0.658020    1

          1    0.15893 -0.74950  1.089150 -0.045123    1

          2    0.13690  0.59210 -0.032990  0.672860    1

          3   -0.50136  0.89913  0.440200  0.812150    1

          4    1.08940  0.43036  0.669470  1.286000    1

          5    0.09310  0.14979 -0.392335  0.040500    1

          6  7  0.63339  1.27161  0.852072  0.474800   2

          7  8 -0.54944 -0.04547  0.867050 -0.234800   2

          8  9  1.28600  1.87650  0.976670  0.440200   2

我使用以下代码创建了上表:

import pandas as pd
df1 = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402] 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}

df=pd.Dataframe(data=df1)

我希望根据列ID'将其分组为特定格式。

所需的输出结构与下表相似:

ID   V1_0_0 V2_0_1  V3_0_2  V4_0_3  V1_1_0  V2_1_1  V3_1_2  V4_1_3    
 1     A       B       C       D      E       F      G        H
 2     I       J       K       L      0       0      0        0

我是在上面引用的最后一个问题的用户 Allen 的帮助下实现的。代码印在下面:

num_V = 4
max_row = df.groupby('ID').ID.count().max()

df= df.groupby('ID').apply(lambda x: x.values[:,1:].reshape(1,-1)
[0].apply(lambda x: x.values[:,1:].reshape(1,-1)[0]).apply(pd.Series) 
.fillna(0)





df.columns = ['V{}_{}_{}'.format(i+1,j,i) for j in range(max_row) for i in 
              range(num_V)]
print(df)

其结果产生以下输出表:

          V1_0_0    V2_0_1   V3_0_2  ***V4_0_3**   V1_1_0   V2_1_1  V3_1_2  \

ID

1         0.93120   1.867650  0.65802    1      -0.74950  1.08915 -0.045123

2         1.27161     0.852072  0.47480   2     -0.04547  0.86705  -0.234800



       **V4_1_3***  V1_2_0   V2_2_1   ...V3_3_2  **V4_3_3**  V1_4_0   V2_4_1  \

ID                                     ...

1         1       0.5921 -0.03299   ...    0.81215    1    0.43036  0.66947

2         2      1.8765  0.97667      ...    0.00000  0    0.00000  0.00000



            V3_4_2  **V4_4_3**   V1_5_0    V2_5_1  V3_5_2  **V4_5_3**

ID

  1          1.286     1        0.14979 -0.392335  0.0405       1

  2          0.000     0        0.00000  0.000000  0.0000       0

这是部分正确的,但问题是某些列在每3列(** **之间的列)之后给出值1和2。 然后在没有与ID'相关的值之后打印1和0。价值2。 在检查之后,我意识到它不是在打印" Var1"值,并且值由一列关闭。 (即V1_0_0应为0.70089,V4_0_3的实际值应为V3_0_2的值,等于0.65802。

有没有办法纠正这个问题,以便得到与我想要的输出表完全相同的东西?如何确保** **标记的列删除它们具有的值并返回正确的值?

我正在使用 Python 3.4 Linux终端上

运行它

Desired Output

感谢。

1 个答案:

答案 0 :(得分:1)

不确定您提供的代码有什么问题,但请尝试一下,让我知道它是否能满足您的需求:

     import pandas as pd

    df = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], 'Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402], 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
    df=pd.DataFrame(df)
    newdataframe=pd.DataFrame(columns=df.columns)
    newID=[]

    for agroup in df.ID.unique():
        temp_df=pd.DataFrame(columns=df.columns)
        adf=df[df.ID==agroup]
        for aline in adf.itertuples():
            a= ((pd.DataFrame(list(aline))).T).drop(columns=[0])
            a.columns=df.columns
            if a.ID.values[0] not in newID:
                 suffix_count=1
                 temp_df=pd.concat([temp_df,a])
                 newID.append(a.ID.values[0])
            else:
                 temp_df = temp_df.merge(a, how='outer', on='ID', suffixes=('', '_'+ str(suffix_count)))
                 suffix_count += 1
                 newdataframe=pd.concat([newdataframe,temp_df])

    print (newdataframe)

输出:

  ID     Var1   Var1_1  Var1_2   Var1_3  Var1_4  Var1_5     Var2   Var2_1  \

 0  1.0  0.70089  0.15893  0.1369 -0.50136  1.0894  0.0931  0.93120 -0.74950   
 0  2.0  0.63339 -0.54944  1.2860      NaN     NaN     NaN  1.27161 -0.04547   

        Var2_2   ...     Var3_2  Var3_3   Var3_4    Var3_5     Var4    Var4_1  \
     0  0.5921   ...   -0.03299  0.4402  0.66947 -0.392335  0.65802    -0.045123   
     0  1.8765   ...    0.97667     NaN      NaN       NaN  0.47480 -0.234800   

        Var4_2   Var4_3  Var4_4  Var4_5  
     0  0.67286  0.81215   1.286  0.0405  
     0  0.44020      NaN     NaN     NaN  

另一个用于实现您要查找的输出的代码:

            import pandas as pd
            import numpy as np
            import re

            df = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], 'Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402], 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
            df=pd.DataFrame(df)
            df['duplicateID']=df['ID'].duplicated()
            newdf=df[df['duplicateID']==False]
            newdf=newdf.reset_index()
            newdf=newdf.iloc[:,1:]
            df=df[df['duplicateID']==True]
            df=df.reset_index()
            df=df.iloc[:,1:]
            del newdf['duplicateID']
            del df['duplicateID']

            merge_count=0
            newID=[]


           for aline in df.itertuples():
                a= ((pd.DataFrame(list(aline))).T).drop(columns=[0])
                a.columns=df.columns
                newdf=newdf.merge(a, how='left', on ='ID', suffixes=('_'+str(merge_count),'_'+str(merge_count+1)))
                merge_count+=1

         newdf.index=newdf['ID']
        del newdf['ID']
        newdf.columns=[col+'_'+str(int(re.findall('\d+',col)[0])-1) for col in newdf.columns]
        print newdf