我已经有了解决方案如何达到预期效果,但对我而言,似乎我的解决方案远非最佳。
现在描述情况:
给定两个不同的Pandas DataFrame,每个都有时间戳作为索引(来自同步时钟)。有关这些可视化和描述符的进一步描述
Table 1
+-----+------+------+-----+------+
| ts1 | m1 | m2 | ... | mi |
+-----+------+------+-----+------+
| t_1 | m1_1 | m2_1 | ... | mi_1 |
| ... | ... | ... | ... | ... |
| t_k | m1_k | m2_k | ... | mi_k |
+-----+------+------+-----+------+
Table 2
+-----+------+------+-----+------+
| ts2 | s1 | s2 | ... | sn |
+-----+------+------+-----+------+
| s_1 | s1_1 | s2_1 | ... | si_1 |
| ... | ... | ... | ... | ... |
| s_k | s1_p | s2_p | ... | si_p |
+-----+------+------+-----+------+
时间戳ts1和ts2很可能不同,但它们相互交叉。
我需要构建一个表格
的结果表Result Table
+-----+------+------+-----+------+------+------+-----+------+
| ts1 | m1 | m2 | ... | mi | s1 | s2 | ... | si |
+-----+------+------+-----+------+------+------+-----+------+
| t_1 | m1_1 | m2_1 | ... | mi_1 | z1_1 | z2_1 | ... | zi_1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| t_k | m1_k | m2_k | ... | mi_k | z1_k | z2_k | ... | zi_k |
+-----+------+------+-----+------+------+------+-----+------+
并且表中给出的值z
应该是最后一个(意味着随着时间的推移,因此使用时间戳)有效条目,s
中的给定数据值等于实际的时间戳之前行。 (我希望可以理解。)
我的解决方案是:
# Combining data
ResultTable=pandas.concat([Table1, Table2]).sort_index()
# retrieving last valid entries for s
ResultTable.s1.fillna(method='pad', inplace=True)
ResultTable.s2.fillna(method='pad', inplace=True)
...
ResultTable.si.fillna(method='pad', inplace=True)
# removing unneeded timestamps `s_1 ... s_k` in result
# many ideas howto do that (deleting rows with NaN in m columns for example)
# please tell me, what would be most efficient?
关于效率的问题 - 关于尺寸的一些细节。 在我的简单示例中,我在表1和8列中有4.000.000行(可能会增长到50)。 表2包含约1.000.000行和85列。
WOW - jezrael
通过他的merge_asof
暗示只用一行代码读取来解决这个问题
test2=pandas.merge_asof(Table1.sort_index(), Table2.sort_index(),
left_index=True, right_index=True)
答案 0 :(得分:3)
另一个代码应该简化:
#if ts2 is column
cols2 = Table2.columns.difference(['ts2'])
#if ts2 is index
#cols2 = Table2.columns
ResultTable[cols2] = ResultTable[cols2].ffill()
代替:
ResultTable.s1.fillna(method='pad',inplace=True)
ResultTable.s2.fillna(method='pad',inplace=True)
...
ResultTable.si.fillna(method='pad',inplace=True)
如果要删除m
列中的NaN,请使用notnull
标识NaN
,检查每行是否all
NaN
并按{{3 }}:
#if ts2 is column
cols1 = Table1.columns.difference(['ts1'])
#if ts1 is index
#cols1 = Table1.columns
m = ResultTable[cols1].notnull().all(axis=1)
ResultTable = ResultTable[m]
样品:
np.random.seed(45)
rng = (pd.date_range('2017-03-26', periods=3).tolist() +
pd.date_range('2017-04-01', periods=2).tolist() +
pd.date_range('2017-04-08', periods=3).tolist() +
pd.date_range('2017-04-13', periods=2).tolist())
Table1 = pd.DataFrame(np.random.randint(10, size=(10, 10)), index=rng).add_prefix('m')
Table1.index.name = 'ts1'
print (Table1)
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9
ts1
2017-03-26 3 0 5 3 4 9 8 1 5 9
2017-03-27 6 8 7 8 5 2 8 1 6 4
2017-03-28 8 4 6 4 9 1 6 8 8 1
2017-04-01 6 0 4 9 8 0 9 2 6 7
2017-04-02 0 0 2 9 2 6 0 9 6 0
2017-04-08 8 8 0 6 7 8 5 1 3 7
2017-04-09 5 9 3 2 7 7 4 9 9 9
2017-04-10 9 7 2 7 9 4 5 7 9 7
2017-04-13 6 2 7 7 6 6 3 6 0 7
2017-04-14 4 9 3 5 7 3 5 5 7 1
rng = (pd.date_range('2017-03-27', periods=3).tolist() +
pd.date_range('2017-04-03', periods=2).tolist() +
pd.date_range('2017-04-06', periods=3).tolist() +
pd.date_range('2017-04-10', periods=2).tolist())
Table2 = pd.DataFrame(np.random.randint(10, size=(10, 10)), index=rng).add_prefix('s')
Table2.index.name = 'ts2'
print (Table2)
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
ts2
2017-03-27 0 2 1 9 2 3 9 6 3 6
2017-03-28 1 9 1 7 4 0 2 1 1 4
2017-03-29 2 2 2 5 3 6 7 5 6 5
2017-04-03 2 8 7 1 2 7 9 6 4 5
2017-04-04 4 5 4 1 3 7 0 5 0 6
2017-04-06 5 8 0 1 9 9 2 4 4 0
2017-04-07 8 2 8 9 7 5 4 3 2 5
2017-04-08 7 9 2 5 8 0 8 9 4 0
2017-04-10 2 5 1 2 1 4 2 3 7 0
2017-04-11 2 0 8 8 6 8 7 5 2 9
ResultTable=pd.concat([Table1, Table2]).sort_index()
cols2 = Table2.columns
ResultTable[cols2] = ResultTable[cols2].ffill()
cols1 = Table1.columns
m = ResultTable[cols1].notnull().all(1)
ResultTable = ResultTable[m]
print (ResultTable)
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 s0 s1 s2 \
2017-03-26 3.0 0.0 5.0 3.0 4.0 9.0 8.0 1.0 5.0 9.0 NaN NaN NaN
2017-03-27 6.0 8.0 7.0 8.0 5.0 2.0 8.0 1.0 6.0 4.0 NaN NaN NaN
2017-03-28 8.0 4.0 6.0 4.0 9.0 1.0 6.0 8.0 8.0 1.0 0.0 2.0 1.0
2017-04-01 6.0 0.0 4.0 9.0 8.0 0.0 9.0 2.0 6.0 7.0 2.0 2.0 2.0
2017-04-02 0.0 0.0 2.0 9.0 2.0 6.0 0.0 9.0 6.0 0.0 2.0 2.0 2.0
2017-04-08 8.0 8.0 0.0 6.0 7.0 8.0 5.0 1.0 3.0 7.0 8.0 2.0 8.0
2017-04-09 5.0 9.0 3.0 2.0 7.0 7.0 4.0 9.0 9.0 9.0 7.0 9.0 2.0
2017-04-10 9.0 7.0 2.0 7.0 9.0 4.0 5.0 7.0 9.0 7.0 7.0 9.0 2.0
2017-04-13 6.0 2.0 7.0 7.0 6.0 6.0 3.0 6.0 0.0 7.0 2.0 0.0 8.0
2017-04-14 4.0 9.0 3.0 5.0 7.0 3.0 5.0 5.0 7.0 1.0 2.0 0.0 8.0
s3 s4 s5 s6 s7 s8 s9
2017-03-26 NaN NaN NaN NaN NaN NaN NaN
2017-03-27 NaN NaN NaN NaN NaN NaN NaN
2017-03-28 9.0 2.0 3.0 9.0 6.0 3.0 6.0
2017-04-01 5.0 3.0 6.0 7.0 5.0 6.0 5.0
2017-04-02 5.0 3.0 6.0 7.0 5.0 6.0 5.0
2017-04-08 9.0 7.0 5.0 4.0 3.0 2.0 5.0
2017-04-09 5.0 8.0 0.0 8.0 9.0 4.0 0.0
2017-04-10 5.0 8.0 0.0 8.0 9.0 4.0 0.0
2017-04-13 8.0 6.0 8.0 7.0 5.0 2.0 9.0
2017-04-14 8.0 6.0 8.0 7.0 5.0 2.0 9.0
另一个解决方案应该是boolean indexing
。