一个熊猫数据帧中的多个表拆分

时间:2020-02-06 12:53:45

标签: python pandas

我已经加载了一个看起来像这样的csv

  Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3
        Date      Price     Quantity     Colour
        2019        45         56         Blue
        2019        34         65         Red
        Date      Price     Quantity     Size
        2019        12         32         Large
        2019        45         11         Small
        Date      Price     Country     
        2019        12         UK        
        2019        45         US         

我想把它变成一个完整的数据帧,所以预期的输出是这样;

    Date      Price     Quantity     Colour  Size  Country
    2019        45         56         Blue   NaN    NaN    
    2019        34         65         Red    NaN    NaN    
    2019        12         32        NaN    Large   NaN    
    2019        45         11        NaN    Small   NaN    
    2019        12         32        NaN    NaN     UK
    2019        45         11        NaN    NaN     US

第一列中的日期一词将始终拆分表

2 个答案:

答案 0 :(得分:4)

如果可以使用Date字符串来区分每个组,则比较并添加Series.cumsum,传递给groupby,并在自定义函数中按第一行创建列,并按{{3 }}:

g = df.iloc[:, 0].eq('Date').cumsum()

def f(x):
    x.columns = x.iloc[0]
    return x.iloc[1:].dropna(how='all', axis=1)

df = df.groupby(g).apply(f).reset_index(drop=True)
print (df)
   Date Price Quantity Colour   Size
0  2019    45       56   Blue    NaN
1  2019    34       65    Red    NaN
2  2019    12       32    NaN  Large
3  2019    45       11    NaN  Small

答案 1 :(得分:1)

我会使用np.arange来使用DataFrame.groupbypd.concat

new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
                                  index =group.index[1:],
                                  data = group.iloc[1:].values)
                    for i,group in df.groupby(np.arange(len(df))//3)],sort = False)
            .dropna(how = 'all',axis = 1))
print(new_df)
   Date Price Quantity Colour   Size Country
1  2019    45       56   Blue    NaN     NaN
2  2019    34       65    Red    NaN     NaN
4  2019    12       32    NaN  Large     NaN
5  2019    45       11    NaN  Small     NaN
7  2019    12      NaN    NaN    NaN      UK
8  2019    45      NaN    NaN    NaN      US

如果要连接的每一帧的长度都不相同,那么我将使用Series.cumsum

new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
                                  index =group.index[1:],
                                  data = group.iloc[1:].values)
                    for i,group in df.groupby(df.iloc[:,0].eq('date').cumsum())],sort = False)
            .dropna(how = 'all',axis = 1))
print(new_df)
0  Date  Price  Quantity Colour
1  2019     45        56   Blue
2  2019     34        65    Red
3  Date  Price  Quantity   Size
4  2019     12        32  Large
5  2019     45        11  Small
6  Date  Price   Country    NaN
7  2019     12        UK    NaN
8  2019     45        US    NaN