Question

我有两只熊猫DataFrames，里面满是Timestamps。我想将这些事件彼此交叉匹配到5天内。如果我要交叉匹配df1和df2我想例如一个大小为len（df1）的列表（一般意义上），其中每个元素都包含df1中元素索引的列表，这些索引位于df2中相应元素的指定时限内。我还想要一个类似的结构，而不是索引，它包含事件之间的天数。

例如：

df1 = pd.DataFrame({'date_1': ['2016-10-10', '2016-10-11', '2016-10-18', '2016-10-29']})
df2 = pd.DataFrame({'date_2': ['2016-10-10', '2016-10-05', '2016-10-27', '2016-10-01']})

输出：

matched_indices = [[0,1], [0], [3], []]
matched_deltas  = [[0,1], [5], [2], []]

有什么想法吗？

Answer 1

一种解决方案是遍历df2的所有行，并找出与df1中的日期之间的差异。

matched_indices = []
matched_deltas = []
# iterate throug hthe rows of df2
for index, row in df2.iterrows():
    # s is a series that stores the difference between the two dates, the index is the same as df1's
    s = abs((df1['date_1'] - row['date_2']).dt.days)
    # keep only the differences that are less than 5
    s = s.where(s<=5).dropna()
    # add the indices to matched_index 
    matched_indices.append(list(s.index.values))
    # add the values to matched_deltas
    matched_deltas.append(list(s.values.astype(int)))

希望对您有帮助！

Answer 2

s = np.abs(df1.date_1.values[:,None]-df2.date_2.values)/np.timedelta64(60*60*24, 's')
newdf=pd.DataFrame(s)
matched_deltas = 
newdf.mask(newdf>5).stack().groupby(level=1).apply(list).reindex(df1.index).tolist()
matched_deltas

matched_indices =newdf.mask(newdf>5).stack().reset_index().groupby('level_1')['level_0'].apply(list).reindex(df1.index).tolist()
matched_indices

输出：

[[0.0, 1.0], [5.0], [2.0], nan]
[[0, 1], [0], [3], nan]

大熊猫及时进行交叉比赛

2 个答案: