我有以下熊猫数据框
df1
code prod rsp date_from date_to time_from time_to
123 MS 75 2018-01-01 2018-01-02 06:00 05:59
123 HS 65 2018-01-01 2018-01-02 06:00 05:59
123 MS 76 2018-01-01 2018-01-02 10:00 05:59
123 MS 76 2018-01-01 2018-01-02 11:00 05:59
123 MS 73 2018-01-02 2018-01-03 06:00 05:59
123 HS 64 2018-01-02 2018-01-03 06:00 05:59
123 MS 73 2018-01-02 2018-01-03 10:00 05:59
我想要的数据框是
code prod rsp_1 date_from date_to time_from_1 time_to_1 rsp_2 time_from_2 time_to_2
123 MS 75 2018-01-01 2018-01-02 06:00 05:59 76 10:00 05:59
123 HS 65 2018-01-01 2018-01-02 06:00 05:59 - - - - -
123 MS 73 2018-01-02 2018-01-03 06:00 05:59 - - - - -
123 HS 64 2018-01-02 2018-01-03 06:00 05:59 - - -
我正在用python进行跟踪
L = list(map(tuple,price[['code','prod','date_from']].values))
s = pd.Series(L, index=price.index)
s = s.ne(s.shift()).cumsum()
g = s.groupby(s).cumcount()
df1 = (price.set_index(['code','prod','date_from', s,g])
.unstack()
.sort_index(level=1, axis=1)
.reset_index(level=2, drop=True))
df1.columns = [f'{i}_{j+1}' for i, j in df1.columns]
df1 = df1.reset_index()
我希望将唯一价格rsp
纳入各列。例如,在产品df1
和MS
的{{1}}和2018年1月1日,date_from
76有两个重复的条目,因此我们仅考虑第一个条目。因此,对于一种产品,我们只有一个日期和相应的价格更改历史记录。
答案 0 :(得分:1)
使用drop_duplicates
,然后似乎应该简化解决方案:
#by one column
price = price.drop_duplicates('rsp')
#if necessary by multiple columns
#cols = ['code','prod','date_from', 'date_to', 'rsp']
#price = price.drop_duplicates(subset=cols)
g = price.groupby(['code','prod','date_from', 'date_to']).cumcount()
df1 = (price.set_index(['code','prod','date_from','date_to', g])
.unstack()
.sort_index(level=1, axis=1))
df1.columns = [f'{i}_{j+1}' for i, j in df1.columns]
df1 = df1.reset_index()
print (df1)
code prod date_from date_to rsp_1 time_from_1 time_to_1 rsp_2 \
0 123 HS 2018-01-01 2018-01-02 65.0 06:00 05:59 NaN
1 123 HS 2018-01-02 2018-01-03 64.0 06:00 05:59 NaN
2 123 MS 2018-01-01 2018-01-02 75.0 06:00 05:59 76.0
3 123 MS 2018-01-02 2018-01-03 73.0 06:00 05:59 NaN
time_from_2 time_to_2
0 NaN NaN
1 NaN NaN
2 10:00 05:59
3 NaN NaN