熊猫合并名称和最近的日期

时间:2014-07-07 15:39:01

标签: python date pandas merge

我正在尝试在名称和最近的日期合并两个数据帧(WRT左手数据帧)。在我的研究中,我发现了一个类似的问题here,但它也没有考虑到这个名字。从上面的问题看来似乎没有办法用merge做这个,但我看不到另一种方法来做两个不使用pandas merge函数的参数连接。

有没有办法合并?如果不是这样做的合适方式是什么?

我会发布我尝试过的副本,但这是在日期上完全合并的尝试,这是行不通的。最重要的一行是我制作data3数据帧的最后一行。

data=pd.read_csv("edgar14Afacts.csv", parse_dates={"dater": [2]}, infer_datetime_format=True)
data2=pd.read_csv("sdcmergersdata.csv", parse_dates={"dater": [17]}, infer_datetime_format=True)
list(data2.columns.values)

data2.rename(columns=lambda x: x.replace('\r\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\r', ''), inplace=True)
data2=data2.rename(columns = {'Acquiror Name':'name'})
data2=data2.rename(columns = {'dater':'date'})
data=data.rename(columns = {'dater':'date'})

list(data2.columns.values)

data["name"]=data['name'].map(str.lower)
data2["name"]=data2['name'].map(str.lower)
data2['date'].fillna(method='pad')
data['namer1']=data['name']
data['dater1']=data['date']
data2['namer2']=data2['name']
data2['dater2']=data2['date']

print data.head()
print data2.head()
data['name'] = data['name'].map(lambda x: str(x)[:4])
data2['name'] = data2['name'].map(lambda x: str(x)[:4])

data3 = pd.merge(data, data2, how='left', on=['date','name'])
data3.to_csv("check.csv")

3 个答案:

答案 0 :(得分:5)

我也很乐意看到你提出的最终解决方案,以了解它最终是如何被淘汰的。

要找到最接近的日期,您可以做的一件事就是计算第一个DataFrame中每个日期与第二个DataFrame中的日期之间的天数。然后,您可以使用np.argmin检索时间增量最小的日期。

例如:

<强>设置

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

数据

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

查看数据

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-02       29
2 2014-01-03        5
3 2014-01-04       73
4 2014-01-05       40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-08       29
2 2014-01-15        5
3 2014-01-22       73
4 2014-01-29       40

Func查找到给定日期的最近日期

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
    # takes a pd.Timestamp() instance and a pd.Series with dates in it
    # calcs the delta between `timepoint` and each date in `time_series`
    # returns the closest date and optionally the number of days in its time delta
    deltas = np.abs(time_series - timepoint)
    idx_closest_date = np.argmin(deltas)
    res = {"closest_date": time_series.ix[idx_closest_date]}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = deltas[idx_closest_date]
        idx.append('closest_delta')
    return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                          find_closest_date, args=[df2.timepoint])
df1.head(10)

   timepoint  measure    closest  days_bt_x_and_y
0 2014-01-01       78 2014-01-01           0 days
1 2014-01-02       29 2014-01-01           1 days
2 2014-01-03        5 2014-01-01           2 days
3 2014-01-04       73 2014-01-01           3 days
4 2014-01-05       40 2014-01-08           3 days
5 2014-01-06       45 2014-01-08           2 days
6 2014-01-07       48 2014-01-08           1 days
7 2014-01-08        2 2014-01-08           0 days
8 2014-01-09       96 2014-01-08           1 days
9 2014-01-10       82 2014-01-08           2 days

合并新closest日期列上的两个数据框

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
    'timepoint_x',
    'closest',
    'timepoint_y',
    'days_bt_x_and_y',
    'measure_x',
    'measure_y'
]

df3 = df3.ix[:, colorder]
df3

   timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48

答案 1 :(得分:4)

这太迟了,但希望它对新的求职者有帮助。我answered a similar question here

在熊猫中使用了一种新方法:

pandas.merge_asof()

您感兴趣的参数将是directiontoleranceleft_onright_on

建立@hernamesbarbara答案和数据:

数据

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

解决方案

import pandas as pd
from pandas import read_csv
from io import StringIO

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])



df1['timepoint'] = pd.to_datetime(df1['timepoint'])
df2['timepoint'] = pd.to_datetime(df2['timepoint'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['timepoint']
df2.index = df2['timepoint']
# the magic happens below, check the direction and tolerance arguments
# if you want you can make a maximum tolerance on which to merge data
tol = pd.Timedelta('3 day')
df3 = pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

输出

df3.head()

    timepoint_x measure_x   timepoint_y measure_y
timepoint               
2014-01-01  2014-01-01  78  2014-01-01  78
2014-01-02  2014-01-02  29  2014-01-01  78
2014-01-03  2014-01-03  5   2014-01-01  78
2014-01-04  2014-01-04  73  2014-01-01  78
2014-01-05  2014-01-05  40  2014-01-08  29

答案 2 :(得分:1)

hernamesbarbara代码的一小部分内容

def find_closest_date(timepoint, time_series, add_time_delta_column=True, mode="abs"):
    """takes a pd.Timestamp() instance and a pd.Series with dates in it
    calcs the delta between `timepoint` and each date in `time_series`
    returns the closest date and optionally the number of days in its time delta

    Parameters
    ----------
    mode: "abs" (default), "left", "right"
        closest datetime by abs, at left, at right

    References
    ----------
    .. [1] http://stackoverflow.com/a/25962323/716469
    """
    deltas = time_series - timepoint

    idx_closest_date = None
    if mode == "abs":
        idx_closest_date = np.argmin(abs(deltas))
    elif mode == "left":
        deltas_ = deltas[deltas <= pd.Timedelta('0 days 00:00:00.0')]
        if len(deltas_):
            idx_closest_date = np.argmax(deltas_)
    elif mode == "right":
        deltas_ = deltas[deltas >= pd.Timedelta('0 days 00:00:00.0')]
        if len(deltas_):
            idx_closest_date = np.argmin(deltas_)
    else:
        raise Exception("Mode is incorrect")

    if idx_closest_date is not None:
        closest_date = time_series.ix[idx_closest_date]
        if add_time_delta_column:
            closest_delta = deltas[idx_closest_date]
    else:
        closest_date = pd.NaT
        if add_time_delta_column:
            closest_delta = pd.Timedelta(pd.NaT)

    res = {"closest_date": closest_date}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = closest_delta
        idx.append('closest_delta')

    return pd.Series(res, index=idx)