我有下面的数据框,其中包含多个日期列及其值:
date value_1 date value_2 date value_3
01-01-1990 1 01-01-1990 2 02-01-1990 4
02-01-1990 3 03-01-1990 20
04-01-1990 30
输出:将所有日期列合并为超集日期列,并相应显示值。
date value_1 value_2 value_3
01-01-1990 1 2
02-01-1990 3 4
03-01-1990 20
04-01-1990 30
答案 0 :(得分:3)
首先需要对具有值列的日期对重复的相同列名称进行重复数据删除:
s = df.columns.to_series()
mask = df.columns.duplicated(keep=False)
c = np.where(mask, s + '_' + (s.groupby(s).cumcount() + 1).astype(str) , s)
df.columns = c
print (df)
date_1 value_1 date_2 value_2 date_3 value_3
0 01-01-1990 1.0 01-01-1990 2 02-01-1990 4.0
1 02-01-1990 3.0 03-01-1990 20 NaN NaN
2 NaN NaN 04-01-1990 30 NaN NaN
然后使用lambda函数ans按groupby
循环,将其成对划分,创建日期列,删除缺失值,最后一起concat
,
dfs = [x.set_index(x.columns[0]).dropna()
for i, x in df.groupby(lambda x: x.split('_')[1], axis=1)]
#print (dfs)
df2 = pd.concat(dfs, axis=1)
print (df2)
value_1 value_2 value_3
01-01-1990 1.0 2.0 NaN
02-01-1990 3.0 NaN 4.0
03-01-1990 NaN 20.0 NaN
04-01-1990 NaN 30.0 NaN
编辑:
日期时间列和接下来的2个数据值列的答案已更改:
print (df)
date_security GH_LAST_PRICE Val GH_VOLUME_PRICE Val date_security \
0 01-01-1990 1.0 7.0 01-01-1990
1 01-02-1990 3.0 8.0 03-01-1990
2 NaN NaN NaN 04-01-1990
DG_LAST_PRICE Val DG_VOLUME_PRICE Val
0 2 10.0
1 20 NaN
2 30 1.0
创建MultiIndex
:
df.columns = [(np.arange(len(df.columns)) // 3).astype(str), df.columns]
print (df)
# 0 1 \
date_security GH_LAST_PRICE Val GH_VOLUME_PRICE Val date_security
0 01-01-1990 1.0 7.0 01-01-1990
1 01-02-1990 3.0 8.0 03-01-1990
2 NaN NaN NaN 04-01-1990
DG_LAST_PRICE Val DG_VOLUME_PRICE Val
0 2 10.0
1 20 NaN
2 30 1.0
dfs = [x.set_index(x.columns[0]).dropna()
for i, x in df.groupby(level=0, axis=1)]
df2 = pd.concat(dfs, axis=1)
#flatten MultiIndex
df2.columns = df2.columns.map('_'.join)
print (df2)
0_GH_LAST_PRICE Val 0_GH_VOLUME_PRICE Val 1_DG_LAST_PRICE Val \
01-01-1990 1.0 7.0 2.0
01-02-1990 3.0 8.0 NaN
04-01-1990 NaN NaN 30.0
1_DG_VOLUME_PRICE Val
01-01-1990 10.0
01-02-1990 NaN
04-01-1990 1.0
答案 1 :(得分:0)
一种方法是将日期/值对垂直堆叠到数据框中
df.columns = ['date_1', 'value_1', 'date_2', 'value_2', 'date_3', 'value_3']
>>> new_df = pd.concat([df[['date_1', 'value_1']].rename(columns={'date_1': 'date'}), df[['date_2', 'value_2']].rename(
columns={'date_2': 'date'}), df[['date_3', 'value_3']].rename(columns={'date_3': 'date'})]).dropna(how='all')
date value_1 value_2 value_3
0 01-01-1990 1.0 NaN NaN
1 02-01-1990 3.0 NaN NaN
2 01-01-1990 NaN 2.0 NaN
3 03-01-1990 NaN 20.0 NaN
4 04-01-1990 NaN 30.0 NaN
5 02-01-1990 NaN NaN 4.0
然后进行分组
new_df.groupby('date',as_index=False).apply(lambda x:x.ffill().bfill().drop_duplicates())
date value_1 value_2 value_3
0 0 01-01-1990 1.0 2.0 NaN
1 1 02-01-1990 3.0 NaN 4.0
2 3 03-01-1990 NaN 20.0 NaN
3 4 04-01-1990 NaN 30.0 NaN