使用基于另一个索引Python pandas

时间:2017-11-03 10:20:23

标签: python pandas

我有两个数据帧,我想根据df_2的时间戳索引创建一个过滤的df_1,如下所示。对于df_2的每个索引值,我想获取df_1的所有行,该行在df_2索引值的1天的timedelta内。

示例:对于df_2索引值10/15/2017,我想在新的df_outcome中包含介于10/14/201710/16/2017之间的所有df_1行,这些行返回{{1} }和10/14/2017 f。将删除查询中的任何重复项。

10/15/2017  g

感谢任何帮助,谢谢。

编辑:

我编辑后将索引更改为时间戳,以反映实际问题。我很抱歉任何混乱,我没想到会有问题。时间戳不均匀。

2 个答案:

答案 0 :(得分:0)

回答更新的问题

新答案基于原始答案,同时使用集合来识别有效索引:

# convert strings to timestamps if not done already
df_1.index = pd.to_datetime(df_1.index)
df_2.index = pd.to_datetime(df_2.index)

# helper function to extract days since epoch
def extract_days_since_epoch(timeseries):
    epoch_start = pd.datetime(1970, 1, 1)
    return (timeseries - epoch_start).days

# get indices as days
index_1_days = extract_days_since_epoch(df_1.index)
index_2_days = extract_days_since_epoch(df_2.index)

threshold = 1
ranges = [range(x-threshold, x+threshold+1) for x in index_2_days]
allowed_indices = {value for sub_range in ranges 
                         for value in sub_range}

# get intersection of allowed and present indicies
valid_indices = allowed_indices.intersection(index_1_days)

# use assign, query and drop to filter matches
df_1.assign(days=index_1_days)\
    .query("days in @valid_indices")\
    .drop(["days"], axis=1) 

            Values
Index   
2017-10-04  b
2017-10-05  c
2017-10-07  d
2017-10-14  f
2017-10-15  g

回答原始问题

您可以为此目的使用pandas Index的设置操作。首先,使用list和set comprehensions创建一组允许的索引。其次,获得允许和现有指数的intersection。最后,使用有效索引到reindex目标数据框。:

# define threshold range to include values from df2
threshold = 10

# create set of allowed indices via set comprehension
ranges = [range(x-threshold, x+threshold+1) for x in df_2.index]
allowed_indices = {value for sub_range in ranges for value in sub_range}

# get intersection of allowed and present indicies
valid_indices = df_1.index.intersection(allowed_indices)

# use reindex with valid indices
df_result = df_1.reindex(valid_indices)

print(df_result)

    Values
10  a
20  b
30  c
40  d
70  g
80  h

答案 1 :(得分:0)

以下索引器应该在日期时间执行操作:

threshold = pd.Timedelta('1 hour')

indexer = pd.Series(df1.index, index=df1.index).apply(
  lambda x: min(abs(x - df2.index)) < threshold
)

df1.loc[indexer] 

注意:它不能很好地扩展。如果len(df1) * len(df2) ~10 6

,它开始需要几秒钟