我有两只熊猫DataFrames
,里面满是Timestamps
。我想将这些事件彼此交叉匹配到5天内。如果我要交叉匹配df1和df2我想例如一个大小为len(df1)的列表(一般意义上),其中每个元素都包含df1中元素索引的列表,这些索引位于df2中相应元素的指定时限内。我还想要一个类似的结构,而不是索引,它包含事件之间的天数。
例如:
df1 = pd.DataFrame({'date_1': ['2016-10-10', '2016-10-11', '2016-10-18', '2016-10-29']})
df2 = pd.DataFrame({'date_2': ['2016-10-10', '2016-10-05', '2016-10-27', '2016-10-01']})
输出:
matched_indices = [[0,1], [0], [3], []]
matched_deltas = [[0,1], [5], [2], []]
有什么想法吗?
答案 0 :(得分:0)
一种解决方案是遍历df2的所有行,并找出与df1中的日期之间的差异。
matched_indices = []
matched_deltas = []
# iterate throug hthe rows of df2
for index, row in df2.iterrows():
# s is a series that stores the difference between the two dates, the index is the same as df1's
s = abs((df1['date_1'] - row['date_2']).dt.days)
# keep only the differences that are less than 5
s = s.where(s<=5).dropna()
# add the indices to matched_index
matched_indices.append(list(s.index.values))
# add the values to matched_deltas
matched_deltas.append(list(s.values.astype(int)))
希望对您有帮助!
答案 1 :(得分:0)
s = np.abs(df1.date_1.values[:,None]-df2.date_2.values)/np.timedelta64(60*60*24, 's')
newdf=pd.DataFrame(s)
matched_deltas =
newdf.mask(newdf>5).stack().groupby(level=1).apply(list).reindex(df1.index).tolist()
matched_deltas
matched_indices =newdf.mask(newdf>5).stack().reset_index().groupby('level_1')['level_0'].apply(list).reindex(df1.index).tolist()
matched_indices
输出:
[[0.0, 1.0], [5.0], [2.0], nan]
[[0, 1], [0], [3], nan]