用于组合DataFrame,fillna和细化的更好或更有效的解决方案

时间:2017-09-29 05:17:32

标签: python python-3.x pandas dataframe

我已经有了解决方案如何达到预期效果,但对我而言,似乎我的解决方案远非最佳。
现在描述情况:
给定两个不同的Pandas DataFrame,每个都有时间戳作为索引(来自同步时钟)。有关这些可视化和描述符的进一步描述

Table 1
+-----+------+------+-----+------+
| ts1 |  m1  |  m2  | ... |  mi  |
+-----+------+------+-----+------+
| t_1 | m1_1 | m2_1 | ... | mi_1 |
| ... | ...  | ...  | ... | ...  |
| t_k | m1_k | m2_k | ... | mi_k |
+-----+------+------+-----+------+

Table 2
+-----+------+------+-----+------+
| ts2 |  s1  |  s2  | ... |  sn  |
+-----+------+------+-----+------+
| s_1 | s1_1 | s2_1 | ... | si_1 |
| ... | ...  | ...  | ... | ...  |
| s_k | s1_p | s2_p | ... | si_p |
+-----+------+------+-----+------+

时间戳ts1和ts2很可能不同,但它们相互交叉。

我需要构建一个表格

的结果表
Result Table
+-----+------+------+-----+------+------+------+-----+------+
| ts1 |  m1  |  m2  | ... |  mi  |  s1  |  s2  | ... |  si  |
+-----+------+------+-----+------+------+------+-----+------+
| t_1 | m1_1 | m2_1 | ... | mi_1 | z1_1 | z2_1 | ... | zi_1 |
| ... | ...  | ...  | ... | ...  | ...  | ...  | ... | ...  |
| t_k | m1_k | m2_k | ... | mi_k | z1_k | z2_k | ... | zi_k |
+-----+------+------+-----+------+------+------+-----+------+

并且表中给出的值z应该是最后一个(意味着随着时间的推移,因此使用时间戳)有效条目,s中的给定数据值等于实际的时间戳之前行。 (我希望可以理解。)

我的解决方案是:

# Combining data
ResultTable=pandas.concat([Table1, Table2]).sort_index()
# retrieving last valid entries for s
ResultTable.s1.fillna(method='pad', inplace=True)
ResultTable.s2.fillna(method='pad', inplace=True)
...
ResultTable.si.fillna(method='pad', inplace=True)
# removing unneeded timestamps `s_1 ... s_k` in result
# many ideas howto do that (deleting rows with NaN in m columns for example)
# please tell me, what would be most efficient?

关于效率的问题 - 关于尺寸的一些细节。 在我的简单示例中,我在表1和8列中有4.000.000行(可能会增长到50)。 表2包含约1.000.000行和85列。

WOW - jezrael通过他的merge_asof暗示只用一行代码读取来解决这个问题

test2=pandas.merge_asof(Table1.sort_index(), Table2.sort_index(),
                        left_index=True, right_index=True)

1 个答案:

答案 0 :(得分:3)

另一个代码应该简化:

#if ts2 is column
cols2 = Table2.columns.difference(['ts2'])
#if ts2 is index
#cols2 = Table2.columns
ResultTable[cols2] = ResultTable[cols2].ffill()

代替:

ResultTable.s1.fillna(method='pad',inplace=True)
ResultTable.s2.fillna(method='pad',inplace=True)
...
ResultTable.si.fillna(method='pad',inplace=True)

如果要删除m列中的NaN,请使用notnull标识NaN,检查每行是否all NaN并按{{3 }}:

#if ts2 is column
cols1 = Table1.columns.difference(['ts1'])
#if ts1 is index
#cols1 = Table1.columns

m = ResultTable[cols1].notnull().all(axis=1)

ResultTable = ResultTable[m]

样品:

np.random.seed(45)


rng = (pd.date_range('2017-03-26', periods=3).tolist() + 
      pd.date_range('2017-04-01', periods=2).tolist() +
      pd.date_range('2017-04-08', periods=3).tolist() + 
      pd.date_range('2017-04-13', periods=2).tolist())
Table1 = pd.DataFrame(np.random.randint(10, size=(10, 10)), index=rng).add_prefix('m') 
Table1.index.name = 'ts1'
print (Table1)
            m0  m1  m2  m3  m4  m5  m6  m7  m8  m9
ts1                                               
2017-03-26   3   0   5   3   4   9   8   1   5   9
2017-03-27   6   8   7   8   5   2   8   1   6   4
2017-03-28   8   4   6   4   9   1   6   8   8   1
2017-04-01   6   0   4   9   8   0   9   2   6   7
2017-04-02   0   0   2   9   2   6   0   9   6   0
2017-04-08   8   8   0   6   7   8   5   1   3   7
2017-04-09   5   9   3   2   7   7   4   9   9   9
2017-04-10   9   7   2   7   9   4   5   7   9   7
2017-04-13   6   2   7   7   6   6   3   6   0   7
2017-04-14   4   9   3   5   7   3   5   5   7   1
rng = (pd.date_range('2017-03-27', periods=3).tolist() + 
      pd.date_range('2017-04-03', periods=2).tolist() +
      pd.date_range('2017-04-06', periods=3).tolist() + 
      pd.date_range('2017-04-10', periods=2).tolist())
Table2 = pd.DataFrame(np.random.randint(10, size=(10, 10)), index=rng).add_prefix('s') 
Table2.index.name = 'ts2' 
print (Table2)
            s0  s1  s2  s3  s4  s5  s6  s7  s8  s9
ts2                                               
2017-03-27   0   2   1   9   2   3   9   6   3   6
2017-03-28   1   9   1   7   4   0   2   1   1   4
2017-03-29   2   2   2   5   3   6   7   5   6   5
2017-04-03   2   8   7   1   2   7   9   6   4   5
2017-04-04   4   5   4   1   3   7   0   5   0   6
2017-04-06   5   8   0   1   9   9   2   4   4   0
2017-04-07   8   2   8   9   7   5   4   3   2   5
2017-04-08   7   9   2   5   8   0   8   9   4   0
2017-04-10   2   5   1   2   1   4   2   3   7   0
2017-04-11   2   0   8   8   6   8   7   5   2   9
ResultTable=pd.concat([Table1, Table2]).sort_index()

cols2 = Table2.columns
ResultTable[cols2] = ResultTable[cols2].ffill()

cols1 = Table1.columns
m = ResultTable[cols1].notnull().all(1)

ResultTable = ResultTable[m]
print (ResultTable)
             m0   m1   m2   m3   m4   m5   m6   m7   m8   m9   s0   s1   s2  \
2017-03-26  3.0  0.0  5.0  3.0  4.0  9.0  8.0  1.0  5.0  9.0  NaN  NaN  NaN   
2017-03-27  6.0  8.0  7.0  8.0  5.0  2.0  8.0  1.0  6.0  4.0  NaN  NaN  NaN   
2017-03-28  8.0  4.0  6.0  4.0  9.0  1.0  6.0  8.0  8.0  1.0  0.0  2.0  1.0   
2017-04-01  6.0  0.0  4.0  9.0  8.0  0.0  9.0  2.0  6.0  7.0  2.0  2.0  2.0   
2017-04-02  0.0  0.0  2.0  9.0  2.0  6.0  0.0  9.0  6.0  0.0  2.0  2.0  2.0   
2017-04-08  8.0  8.0  0.0  6.0  7.0  8.0  5.0  1.0  3.0  7.0  8.0  2.0  8.0   
2017-04-09  5.0  9.0  3.0  2.0  7.0  7.0  4.0  9.0  9.0  9.0  7.0  9.0  2.0   
2017-04-10  9.0  7.0  2.0  7.0  9.0  4.0  5.0  7.0  9.0  7.0  7.0  9.0  2.0   
2017-04-13  6.0  2.0  7.0  7.0  6.0  6.0  3.0  6.0  0.0  7.0  2.0  0.0  8.0   
2017-04-14  4.0  9.0  3.0  5.0  7.0  3.0  5.0  5.0  7.0  1.0  2.0  0.0  8.0   

             s3   s4   s5   s6   s7   s8   s9  
2017-03-26  NaN  NaN  NaN  NaN  NaN  NaN  NaN  
2017-03-27  NaN  NaN  NaN  NaN  NaN  NaN  NaN  
2017-03-28  9.0  2.0  3.0  9.0  6.0  3.0  6.0  
2017-04-01  5.0  3.0  6.0  7.0  5.0  6.0  5.0  
2017-04-02  5.0  3.0  6.0  7.0  5.0  6.0  5.0  
2017-04-08  9.0  7.0  5.0  4.0  3.0  2.0  5.0  
2017-04-09  5.0  8.0  0.0  8.0  9.0  4.0  0.0  
2017-04-10  5.0  8.0  0.0  8.0  9.0  4.0  0.0  
2017-04-13  8.0  6.0  8.0  7.0  5.0  2.0  9.0  
2017-04-14  8.0  6.0  8.0  7.0  5.0  2.0  9.0  

另一个解决方案应该是boolean indexing