我已经加载了一个看起来像这样的csv
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
Date Price Quantity Colour
2019 45 56 Blue
2019 34 65 Red
Date Price Quantity Size
2019 12 32 Large
2019 45 11 Small
Date Price Country
2019 12 UK
2019 45 US
我想把它变成一个完整的数据帧,所以预期的输出是这样;
Date Price Quantity Colour Size Country
2019 45 56 Blue NaN NaN
2019 34 65 Red NaN NaN
2019 12 32 NaN Large NaN
2019 45 11 NaN Small NaN
2019 12 32 NaN NaN UK
2019 45 11 NaN NaN US
第一列中的日期一词将始终拆分表
答案 0 :(得分:4)
如果可以使用Date
字符串来区分每个组,则比较并添加Series.cumsum
,传递给groupby
,并在自定义函数中按第一行创建列,并按{{3 }}:
g = df.iloc[:, 0].eq('Date').cumsum()
def f(x):
x.columns = x.iloc[0]
return x.iloc[1:].dropna(how='all', axis=1)
df = df.groupby(g).apply(f).reset_index(drop=True)
print (df)
Date Price Quantity Colour Size
0 2019 45 56 Blue NaN
1 2019 34 65 Red NaN
2 2019 12 32 NaN Large
3 2019 45 11 NaN Small
答案 1 :(得分:1)
我会使用np.arange
来使用DataFrame.groupby
和pd.concat
:
new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
index =group.index[1:],
data = group.iloc[1:].values)
for i,group in df.groupby(np.arange(len(df))//3)],sort = False)
.dropna(how = 'all',axis = 1))
print(new_df)
Date Price Quantity Colour Size Country
1 2019 45 56 Blue NaN NaN
2 2019 34 65 Red NaN NaN
4 2019 12 32 NaN Large NaN
5 2019 45 11 NaN Small NaN
7 2019 12 NaN NaN NaN UK
8 2019 45 NaN NaN NaN US
如果要连接的每一帧的长度都不相同,那么我将使用Series.cumsum
new_df = (pd.concat([pd.DataFrame(columns = group.iloc[0],
index =group.index[1:],
data = group.iloc[1:].values)
for i,group in df.groupby(df.iloc[:,0].eq('date').cumsum())],sort = False)
.dropna(how = 'all',axis = 1))
print(new_df)
0 Date Price Quantity Colour
1 2019 45 56 Blue
2 2019 34 65 Red
3 Date Price Quantity Size
4 2019 12 32 Large
5 2019 45 11 Small
6 Date Price Country NaN
7 2019 12 UK NaN
8 2019 45 US NaN