根据两列中的匹配值为日期差异创建条件列

时间:2019-05-03 06:59:36

标签: python pandas python-2.7 numpy data-science

我有一个数据框,我正在努力根据其他列创建一个列,我将分享示例数据的问题。

        Date    Target1        Close
0   2019-04-17  209.2440    203.130005
1   2019-04-17  212.2155    203.130005
2   2019-04-17  213.6330    203.130005
3   2019-04-17  213.0555    203.130005
4   2019-04-17  212.6250    203.130005
5   2019-04-17  212.9820    203.130005
6   2019-04-17  213.1395    203.130005
7   2019-04-16  209.2860    199.250000
8   2019-04-16  209.9055    199.250000
9   2019-04-16  210.3045    199.250000

我想创建另一列(用于每个观察)(例如称为days_to_hit_target),该列是关闭时(或非常接近特定日期的目标)的匹配天数那么非常紧密,那么它会计算天数的差异并将其放在days_to_hit_target列中。

3 个答案:

答案 0 :(得分:2)

注意,我使用python 3.7.1和pandas 0.23.4 。我想到了一件很脏的东西。我确信有一种更整齐,更有效的方法。

### Create sample data
date_range = pd.date_range(start="1/1/2018", end="20/1/2018", freq="6H", closed="right")

target1 = np.random.uniform(10, 30, len(date_range))

close = [[i]*4 for i in np.random.uniform(10,30, len(date_range)//4)]
close_flat = np.array([item for sublist in close for item in sublist])

df = pd.DataFrame(np.array([np.array(date_range.date), target1,
    close_flat]).transpose(), columns=["date", "target", "close"])


### Create the column you need
# iterating over the days and finding days when the difference between
# "close" of current day and all "target" is lower than 0.25 OR the "target"
# value is greater than "close" value.
thresh = 0.25
date_diff_arr = np.zeros(len(df))
for i in range(0,len(df),4):
    diff_lt_thresh = df[(abs(df.target-df.close.iloc[i]) < thresh) | (df.target > df.close.iloc[i])]
    # only keep the findings from the next day onwards
    diff_lt_thresh = diff_lt_thresh.loc[i+4:]
    if not diff_lt_thresh.empty:
        # find day difference only if something under thresh is found
        days_diff = (diff_lt_thresh.iloc[0].date - df.iloc[i].date).days
    else:
        # otherwise write it as nan
        days_diff = np.nan
    # fill in the np.array which will be used to write to the df
    date_diff_arr[i:i+4] = days_diff

df["date_diff"] = date_diff_arr

示例输出:

0   2018-01-01    21.64  26.7319        2.0
1   2018-01-01  22.9047  26.7319        2.0
2   2018-01-01  26.0945  26.7319        2.0
3   2018-01-02  10.2155  26.7319        2.0
4   2018-01-02  17.5602  11.0507        1.0
5   2018-01-02  12.0368  11.0507        1.0
6   2018-01-02  19.5923  11.0507        1.0
7   2018-01-03  21.8168  11.0507        1.0
8   2018-01-03  11.5433  16.8862        1.0
9   2018-01-03  27.3739  16.8862        1.0
10  2018-01-03  26.9073  16.8862        1.0
11  2018-01-04  19.6677  16.8862        1.0
12  2018-01-04  25.3599  27.3373        1.0
13  2018-01-04  22.7479  27.3373        1.0
14  2018-01-04  18.7246  27.3373        1.0
15  2018-01-05  25.4122  27.3373        1.0
16  2018-01-05  28.3294  23.8469        1.0

答案 1 :(得分:2)

这应该有效:

daysAboveTarget = []
for i in range(len(df.Date)):
    try:
        dayAboveTarget = df.iloc[i:].loc[(df.Close > df.Target1[i])]['Date'].iloc[0]
    except IndexError:
        dayAboveTarget = None
    daysAboveTarget.append(dayAboveTarget)
daysAboveTarget = pd.Series(daysAboveTarget)
df['days_to_hit_target'] = daysAboveTarget - df.Date

我在这里过度使用了iloc和loc,所以让我解释一下。 当价格收盘价高于目标价时,变量dayAboveTarget获取日期。第一个iloc将数据框的子集仅包含未来的日期,第一个loc查找实际结果,第二个iloc仅获取第一个结果。价格在几天内永远都不会超出目标的情况。

答案 2 :(得分:0)

也许是更快的解决方案:

import pandas as pd

# df is your DataFrame
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values("Date")

def days_to_hit(x, no_hit_default=None):
    return next(
        ((df["Date"].iloc[j+x.name] - x["Date"]).days 
         for j in range(len(df)-x.name) 
         if df["Close"].iloc[j+x.name] >= x["Target1"]), no_hit_default)

df["days_to_hit_target"] = df.apply(days_to_hit, axis=1)