我有pandas DataFrame
这样:
df = pd.DataFrame({'custid':[1,2,3,4],
...: 'prod1':['jeans','tshirt','jacket','tshirt'],
...: 'prod1_hnode1':[1,2,3,2],
...: 'prod1_hnode2':[6,7,8,7],
...: 'prod2':['tshirt','jeans','jacket','shirt'],
...: 'prod2_hnode1':[2,1,3,4],
...: 'prod2_hnode2':[7,6,8,7]})
In [54]: df
Out[54]:
custid prod1 prod1_hnode1 prod1_hnode2 prod2 prod2_hnode1 \
0 1 jeans 1 6 tshirt 2
1 2 tshirt 2 7 jeans 1
2 3 jacket 3 8 jacket 3
3 4 tshirt 2 7 shirt 4
prod2_hnode2
0 7
1 6
2 8
3 7
如何将其转换为以下格式:
dfnew = pd.DataFrame({'custid':[1,1,2,2,3,3,4,4],
...: 'prod':['prod1','prod2','prod1','prod2','prod1','prod2','prod1','prod2'],
...: 'rec':['jeans','tshirt','tshirt','jeans','jacket','jacket','tshirt','shirt'],
...: 'hnode1':[1,2,2,1,3,3,2,4],
...: 'hnode2':[6,7,7,6,8,8,7,7]})
In [56]: dfnew
Out[56]:
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
答案 0 :(得分:4)
使用:
set_index
列custid
split
MultiIndex
NaN
rec
第一级stack
reset_index
来自MultiIndex
df = df.set_index('custid')
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns={np.nan:'rec'})
cols = ['custid','hnode1','hnode2','prod','rec']
df = df.stack(0).reset_index().rename(columns={'level_1':'prod'}).reindex(columns=cols)
print (df)
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
答案 1 :(得分:1)
这是另一种应该有效的方法,但使用了重复melt
s。
coln = df.dtypes.index # save some typing
df_long = pd.melt(
df, id_vars = "custid", value_vars = ["prod1", "prod2"],
var_name = "prod", value_name = "rec").assign(
hnode1 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode1" in x, coln))["value"],
hnode2 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode2" in x, coln))["value"])
print(df_long)
custid prod rec hnode1 hnode2
0 1 prod1 jeans 1 6
1 2 prod1 tshirt 2 7
2 3 prod1 jacket 3 8
3 4 prod1 tshirt 2 7
4 1 prod2 tshirt 2 7
5 2 prod2 jeans 1 6
6 3 prod2 jacket 3 8
7 4 prod2 shirt 4 7
你在评论中提到R. melt
来自" data.table"应该能够更轻松地处理这个问题,因为您可以同时融合多组列,类似于您使用基本R reshape
函数解决问题的方式。
基础R方法可能类似于:
reshape(df, direction = "long", idvar = "custid",
varying = list(c(2, 5), c(3, 6), c(4, 7)),
sep = "", times = c("prod1", "prod2"))