查找特定类别的移动时间段的数据帧长度

时间:2017-11-06 05:34:25

标签: python pandas date

我有两个DataFrame:

DF:

    Name    Date_1  Date_2
0   Alan    2013-06-21  2013-06-26
1   Bob     2011-01-29  2011-02-01
2   Chris   2010-11-15  2010-11-17
3   Bob     2016-03-14  2016-03-16
4   Doug    2011-03-07  2011-03-10
5   Elijah  2011-02-24  2011-03-01
6   Bob     2011-01-03  2011-01-13
7   Bob     2011-02-07  2011-02-25
8   Frank   2014-07-21  2014-07-23
9   Chris   2011-02-18  2011-02-22
10  Doug    2010-09-13  2010-09-17
11  Chris   2011-01-15  2011-01-19
12  George  2010-06-29  2010-06-30

和df1:

        Date        Name    Period
12971   2015-08-18  Alan    2015-08-16
12972   2015-08-19  Alan    2015-08-17
12973   2015-08-20  Alan    2015-08-18
12974   2015-08-21  Alan    2015-08-19
12975   2015-08-22  Alan    2015-08-20
12976   2015-08-23  Alan    2015-08-21
12977   2015-08-24  Alan    2015-08-22
12978   2015-08-25  Alan    2015-08-23
12979   2015-08-26  Alan    2015-08-24
12980   2015-08-27  Alan    2015-08-25
12981   2015-08-28  Alan    2015-08-26
12982   2015-08-29  Alan    2015-08-27

两个数据帧都是数千行,这只是一个示例,我想要做的是找到df中的行数,其中Date小于Date_2,Date_1大于Period的特定名称在df中

我已经完成了以下工作,但结果非常缓慢:

df1['Volume'] = df1.apply(lambda x: len(df[(df['Name'] == x['Name']) & (
df['Date_1'] < x['Period']) & (df['Date_2'] > x['Date'])]), axis=1)

如果您有任何建议,请告诉我

1 个答案:

答案 0 :(得分:0)

由于您当前的版本没有任何重叠,我不得不稍微更改您提供的表格。我假设您要执行以下操作:

模式很简单,首先在两个表上进行外连接,在结果上进行数据透视,最后在您感兴趣的数据帧上保持连接。这应该比您的方法更快,但是,它可能会占用更多内存。

解析数据

首先要做的是对你提供的数据进行一些基本的解析并将其放入数据框中(你可以跳过这个我提供这个以便追溯):

import pandas as pd
from io import StringIO
import re

# First set tables
table = """    Name    Date_1  Date_2
0   Alan    2013-06-21  2013-06-26
1   Bob     2011-01-29  2011-02-01
2   Chris   2010-11-15  2010-11-17
3   Bob     2016-03-14  2016-03-16
4   Doug    2011-03-07  2011-03-10
5   Elijah  2011-02-24  2011-03-01
6   Bob     2011-01-03  2011-01-13
7   Bob     2011-02-07  2011-02-25
8   Frank   2014-07-21  2014-07-23
9   Chris   2011-02-18  2011-02-22
10  Doug    2010-09-13  2010-09-17
11  Chris   2011-01-15  2011-01-19
12  George  2010-06-29  2010-06-30"""

table2 = """        Date        Name    Period
12971   2015-08-18  Alan    2015-08-16
12972   2015-08-19  Alan    2015-08-17
12973   2015-08-20  Alan    2015-08-18
12974   2015-08-21  Alan    2015-08-19
12975   2015-08-22  Alan    2015-08-20
12976   2015-08-23  Alan    2015-08-21
12977   2015-08-24  Alan    2015-08-22
12978   2015-08-25  Alan    2015-08-23
12979   2015-08-26  Alan    2015-08-24
12980   2015-08-27  Alan    2015-08-25
12981   2015-08-28  Alan    2015-08-26
12982   2015-08-29  Alan    2015-08-27
12983   2013-06-24  Alan    2013-06-25"""

# Prepare tables in format that makes date lookups easier
series = pd.read_csv(StringIO(table))['    Name    Date_1  Date_2'].apply(lambda x: ["".join(re.findall("[A-Za-z0-9-]",i)) for i in x.split(" ") if re.findall("[A-Za-z0-9-]",i) != []])
df = pd.DataFrame(series.values.tolist(), columns = ["index", "Name", "Date_1","Date_2"])
df["Date_1"] = pd.to_datetime(df["Date_1"])
df["Date_2"] = pd.to_datetime(df["Date_2"])

series = pd.read_csv(StringIO(table2))['        Date        Name    Period'].apply(lambda x: ["".join(re.findall("[A-Za-z0-9-]",i)) for i in x.split(" ") if re.findall("[A-Za-z0-9-]",i) != []])
df1 = pd.DataFrame(series.values.tolist(), columns = ["index", "Date", "Name","Period"])
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Period"] = pd.to_datetime(df1["Period"])

外部联接

Name列上的外部联接很简单:

outer = pd.merge(df1,df, on="Name",how="outer")

合并到df1(假定的要求)

只需首先按NameDatePeriod对数据进行调整,然后计算这些数据。然后reset_index并与原始表合并,当找不到查找时,我假设0值。

# Pivot table
pivot = outer[(outer["Date_1"] < outer["Period"]) & (outer["Date_2"] > outer["Date"])].pivot_table(index=["Name","Date","Period"],
                                                                                    values= ["Date_1"],
                                                                                    aggfunc="count").reset_index()
# Rename columns for merging
pivot.columns = [["Name","Date","Period","Volume"]]

pd.merge(df1,pivot, how = "left", on=["Name","Date","Period"]).fillna(0)

合并df(完整性)

# Pivot table
pivot = outer[(outer["Date_1"] < outer["Period"]) & (outer["Date_2"] > outer["Date"])].pivot_table(index=["Name"],
                                                                                    values= ["Date"],
                                                                                    aggfunc="count").reset_index()
# Rename columns for merging
pivot.columns = [["Name","Volume"]]

pd.merge(df,pivot, how = "left", on="Name").fillna(0)