我有一个类似下面的df,我想在一个新的DataFrame中对其进行转换
# column_1 column_2 column_3 column_4
# ticket 12345
# Date 2020-02-01
# UPC Code Description Qty Unit Price
# 987654 product 1 1 10
# 879756 product 2 1 7
# 987895 product 3 2 5
# ticket 12346
# Date 2020-02-03
# UPC Code Description Qty Unit Price
# 987654 product 1 1 10
# 997651 product 4 1 3
# ticket 12347
这是新数据框的示例:
# ticket date upc_code description qty unit_price
# 12345 2020-02-01 987654 product 1 1 10
# 12345 2020-02-01 879756 product 2 1 7
# 12345 2020-02-01 987895 product 3 2 5
# 12346 2020-02-03 987654 product 1 1 10
# 12346 2020-02-03 997651 product 4 1 3
# 12347
有人可以帮我吗?我正在尝试找出方法。 每个票证值都是一个采购订单,票证和日期值应根据每个订单上的产品重复。 然后,Upc代码下面的行随所购买商品的数量而变化。
提前谢谢!
答案 0 :(得分:0)
IIUC,您需要按字符串过滤空白行,然后在进行一些旋转后重新连接。
如果空格不是真空值,则可以使用以下代码行将其转换。
import numpy as np
df = df.replace('',np.nan,regex=True)
print(df)
column_1 column_2 column_3 column_4
0 ticket 12345 NaN NaN
1 Date 2020-02-01 NaN NaN
2 UPC Code Description Qty Unit Price
3 987654 product 1 1 10
4 879756 product 2 1 7
5 987895 product 3 2 5
6 ticket 12346 NaN NaN
7 Date 2020-02-03 NaN NaN
8 UPC Code Description Qty Unit Price
9 987654 product 1 1 10
10 997651 product 4 1 3
11 ticket 12347 NaN NaN
s = df.dropna(how='any').loc[~df["column_1"].str.contains("UPC Code")]
s1 = df[~df.index.isin(s.index) & ~df["column_1"].str.contains("UPC Code")]
df2 = pd.concat(
[
pd.crosstab(s1.index, s1["column_1"], s1["column_2"], aggfunc="first")
.ffill()
.bfill()
.reset_index(drop=True),
s.reset_index(drop=True),
],
axis=1,
)
print(df2)
Date ticket column_1 column_2 column_3 column_4
0 2020-02-01 12345 987654 product 1 1 10
1 2020-02-01 12345 879756 product 2 1 7
2 2020-02-01 12346 987895 product 3 2 5
3 2020-02-03 12346 987654 product 1 1 10
4 2020-02-03 12347 997651 product 4 1 3