我有两个DataFrame:
DF:
Name Date_1 Date_2
0 Alan 2013-06-21 2013-06-26
1 Bob 2011-01-29 2011-02-01
2 Chris 2010-11-15 2010-11-17
3 Bob 2016-03-14 2016-03-16
4 Doug 2011-03-07 2011-03-10
5 Elijah 2011-02-24 2011-03-01
6 Bob 2011-01-03 2011-01-13
7 Bob 2011-02-07 2011-02-25
8 Frank 2014-07-21 2014-07-23
9 Chris 2011-02-18 2011-02-22
10 Doug 2010-09-13 2010-09-17
11 Chris 2011-01-15 2011-01-19
12 George 2010-06-29 2010-06-30
和df1:
Date Name Period
12971 2015-08-18 Alan 2015-08-16
12972 2015-08-19 Alan 2015-08-17
12973 2015-08-20 Alan 2015-08-18
12974 2015-08-21 Alan 2015-08-19
12975 2015-08-22 Alan 2015-08-20
12976 2015-08-23 Alan 2015-08-21
12977 2015-08-24 Alan 2015-08-22
12978 2015-08-25 Alan 2015-08-23
12979 2015-08-26 Alan 2015-08-24
12980 2015-08-27 Alan 2015-08-25
12981 2015-08-28 Alan 2015-08-26
12982 2015-08-29 Alan 2015-08-27
两个数据帧都是数千行,这只是一个示例,我想要做的是找到df中的行数,其中Date小于Date_2,Date_1大于Period的特定名称在df中
我已经完成了以下工作,但结果非常缓慢:
df1['Volume'] = df1.apply(lambda x: len(df[(df['Name'] == x['Name']) & (
df['Date_1'] < x['Period']) & (df['Date_2'] > x['Date'])]), axis=1)
如果您有任何建议,请告诉我
答案 0 :(得分:0)
由于您当前的版本没有任何重叠,我不得不稍微更改您提供的表格。我假设您要执行以下操作:
模式很简单,首先在两个表上进行外连接,在结果上进行数据透视,最后在您感兴趣的数据帧上保持连接。这应该比您的方法更快,但是,它可能会占用更多内存。
首先要做的是对你提供的数据进行一些基本的解析并将其放入数据框中(你可以跳过这个我提供这个以便追溯):
import pandas as pd
from io import StringIO
import re
# First set tables
table = """ Name Date_1 Date_2
0 Alan 2013-06-21 2013-06-26
1 Bob 2011-01-29 2011-02-01
2 Chris 2010-11-15 2010-11-17
3 Bob 2016-03-14 2016-03-16
4 Doug 2011-03-07 2011-03-10
5 Elijah 2011-02-24 2011-03-01
6 Bob 2011-01-03 2011-01-13
7 Bob 2011-02-07 2011-02-25
8 Frank 2014-07-21 2014-07-23
9 Chris 2011-02-18 2011-02-22
10 Doug 2010-09-13 2010-09-17
11 Chris 2011-01-15 2011-01-19
12 George 2010-06-29 2010-06-30"""
table2 = """ Date Name Period
12971 2015-08-18 Alan 2015-08-16
12972 2015-08-19 Alan 2015-08-17
12973 2015-08-20 Alan 2015-08-18
12974 2015-08-21 Alan 2015-08-19
12975 2015-08-22 Alan 2015-08-20
12976 2015-08-23 Alan 2015-08-21
12977 2015-08-24 Alan 2015-08-22
12978 2015-08-25 Alan 2015-08-23
12979 2015-08-26 Alan 2015-08-24
12980 2015-08-27 Alan 2015-08-25
12981 2015-08-28 Alan 2015-08-26
12982 2015-08-29 Alan 2015-08-27
12983 2013-06-24 Alan 2013-06-25"""
# Prepare tables in format that makes date lookups easier
series = pd.read_csv(StringIO(table))[' Name Date_1 Date_2'].apply(lambda x: ["".join(re.findall("[A-Za-z0-9-]",i)) for i in x.split(" ") if re.findall("[A-Za-z0-9-]",i) != []])
df = pd.DataFrame(series.values.tolist(), columns = ["index", "Name", "Date_1","Date_2"])
df["Date_1"] = pd.to_datetime(df["Date_1"])
df["Date_2"] = pd.to_datetime(df["Date_2"])
series = pd.read_csv(StringIO(table2))[' Date Name Period'].apply(lambda x: ["".join(re.findall("[A-Za-z0-9-]",i)) for i in x.split(" ") if re.findall("[A-Za-z0-9-]",i) != []])
df1 = pd.DataFrame(series.values.tolist(), columns = ["index", "Date", "Name","Period"])
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Period"] = pd.to_datetime(df1["Period"])
Name
列上的外部联接很简单:
outer = pd.merge(df1,df, on="Name",how="outer")
只需首先按Name
,Date
和Period
对数据进行调整,然后计算这些数据。然后reset_index
并与原始表合并,当找不到查找时,我假设0
值。
# Pivot table
pivot = outer[(outer["Date_1"] < outer["Period"]) & (outer["Date_2"] > outer["Date"])].pivot_table(index=["Name","Date","Period"],
values= ["Date_1"],
aggfunc="count").reset_index()
# Rename columns for merging
pivot.columns = [["Name","Date","Period","Volume"]]
pd.merge(df1,pivot, how = "left", on=["Name","Date","Period"]).fillna(0)
# Pivot table
pivot = outer[(outer["Date_1"] < outer["Period"]) & (outer["Date_2"] > outer["Date"])].pivot_table(index=["Name"],
values= ["Date"],
aggfunc="count").reset_index()
# Rename columns for merging
pivot.columns = [["Name","Volume"]]
pd.merge(df,pivot, how = "left", on="Name").fillna(0)