我在Python中有一组记录,包含id,至少一个属性和一组日期范围。我想要获取每个id的代码,并组合属性匹配的所有记录,并且日期范围内没有间隙。
在日期范围内没有间隙,我的意思是一条记录的结束日期大于或等于该ID的下一条记录。
例如,ID为#" 10",开始日期" 2016-01-01"和结束日期" 2017-01-01"可以与具有该ID的另一个记录合并,开始日期为" 2017-01-01",结束日期为" 2018-01-01",但它不能与2017-01-10"开始的记录合并,因为2017-01-01至2017-01-09之间存在差距。
以下是一些例子 -
有:
FruitID,FruitType,StartDate,EndDate
1,Apple,2015-01-01,2016-01-01
1,Apple,2016-01-01,2017-01-01
1,Apple,2017-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2017-01-01
2,Orange,2017-01-01,2018-01-01
3,Banana,2015-01-01,2016-01-01
3,Banana,2016-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01
想要:
FruitID,FruitType,NewStartDate,NewEndDate
1,Apple,2015-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2018-01-01
3,Banana,2015-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01
我目前的解决方案如下。它提供了我正在寻找的结果,但对于大型数据集而言,性能似乎并不好。另外,我的印象是您通常希望尽可能避免迭代数据帧的各个行。非常感谢您提供的任何帮助!
import pandas as pd
from dateutil.parser import parse
have = pd.DataFrame.from_items([('FruitID', [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]),
('FruitType', ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana', 'Blueberry', 'Mango', 'Kiwi', 'Mango']),
('StartDate', [parse(x) for x in ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
'2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-09-15', '2017-01-01']]),
('EndDate', [parse(x) for x in ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
'2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01']])
])
have.sort_values(['FruitID', 'StartDate'])
rowlist = []
fruit_cur_row = None
for row in have.itertuples():
if fruit_cur_row is None:
fruit_cur_row = row._asdict()
fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)
elif not(fruit_cur_row.get('FruitType') == row.FruitType):
rowlist.append(fruit_cur_row)
fruit_cur_row = row._asdict()
fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)
elif (row.StartDate <= fruit_cur_row.get('NewEndDate')):
fruit_cur_row['NewEndDate'] = max(fruit_cur_row['NewEndDate'], row.EndDate)
else:
rowlist.append(fruit_cur_row)
fruit_cur_row = row._asdict()
fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)
rowlist.append(fruit_cur_row)
have_mrg = pd.DataFrame.from_dict(rowlist)
print(have_mrg[['FruitID', 'FruitType', 'NewStartDate', 'NewEndDate']])
答案 0 :(得分:1)
使用嵌套的groupby
方法:
def merge_dates(grp):
# Find contiguous date groups, and get the first/last start/end date for each group.
dt_groups = (grp['StartDate'] != grp['EndDate'].shift()).cumsum()
return grp.groupby(dt_groups).agg({'StartDate': 'first', 'EndDate': 'last'})
# Perform a groupby and apply the merge_dates function, followed by formatting.
df = df.groupby(['FruitID', 'FruitType']).apply(merge_dates)
df = df.reset_index().drop('level_2', axis=1)
请注意,此方法假定您的日期已经过排序。如果没有,您首先需要在DataFrame上使用sort_values
。如果您有嵌套的日期跨度,则此方法可能无效。
结果输出:
FruitID FruitType StartDate EndDate
0 1 Apple 2015-01-01 2018-01-01
1 2 Orange 2015-01-01 2016-01-01
2 2 Orange 2016-05-31 2018-01-01
3 3 Banana 2015-01-01 2017-01-01
4 3 Blueberry 2017-01-01 2018-01-01
5 4 Kiwi 2016-09-15 2017-01-01
6 4 Mango 2015-01-01 2016-01-01
7 4 Mango 2017-01-01 2018-01-01
答案 1 :(得分:0)
这是我想出的......
df = pd.melt(data, id_vars=['FruitID', 'FruitType'], var_name='WhichDate', value_name='Date')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['FruitType', 'Date']).drop_duplicates(['FruitType', 'Date'])
df = df.assign(Counter = np.nan)
StartDf = df[df['WhichDate']=='StartDate']
StartDf = StartDf.assign(Counter=np.arange(len(StartDf)))
df[df['WhichDate']=='StartDate'] = StartDf
df.fillna(method='ffill', inplace=True)
s = df.groupby(['Counter', 'FruitID', 'FruitType']).agg({'Date': [min, max]}).rename(columns={'min': 'NewStartDate', 'max': 'NewEndDate'})
s.columns = s.columns.droplevel()
s = s.reset_index()
del s['Counter']
s = s.sort_values(['FruitID', 'FruitType']).reset_index(drop=True)
哪些输出......
FruitID FruitType NewStartDate NewEndDate
0 1 Apple 2015-01-01 2018-01-01
1 2 Orange 2015-01-01 2016-01-01
2 2 Orange 2016-05-31 2018-01-01
3 3 Banana 2015-01-01 2017-01-01
4 3 Blueberry 2017-01-01 2018-01-01
5 4 Kiwi 2016-09-15 2017-01-01
6 4 Mango 2015-01-01 2016-01-01
7 4 Mango 2017-01-01 2018-01-01
<强>解释强>
首先,我重新创建了您的数据框。
data = pd.DataFrame({'FruitID' : [1,1,1,2,2,2,3,3,3,4,4,4],
'FruitType': ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana',
'Blueberry', 'Mango', 'Kiwi',
'Mango'],
'StartDate': ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
'2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01',
'2016-09-15', '2017-01-01'],
'EndDate' : ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
'2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
'2018-01-01']})
接下来,我使用pandas melt
函数将数据重新整形为长格式。
df = pd.melt(data, id_vars=['FruitID', 'FruitType'], var_name='WhichDate', value_name='Date')
然后,我按每个水果类型的日期排序,并删除任何具有重复日期的行
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['FruitType', 'Date']).drop_duplicates(['FruitType', 'Date'])
我创建了一个帮助列,用于使用StartDate标记每一行。我们需要在执行groupby
之前执行此操作。然后使用fillna
来帮助对组进行分区。
df = df.assign(Counter = np.nan)
StartDf = df[df['WhichDate']=='StartDate']
StartDf = StartDf.assign(Counter=np.arange(len(StartDf)))
df[df['WhichDate']=='StartDate'] = StartDf
df.fillna(method='ffill', inplace=True)
最后,我们使用groupby
和agg
来获取每个分区的min
和max
日期。
s = df.groupby(['Counter', 'FruitID', 'FruitType']).agg({'Date': [min, max]}).rename(columns={'min': 'NewStartDate', 'max': 'NewEndDate'})
s.columns = s.columns.droplevel()
s = s.reset_index()
del s['Counter']
s = s.sort_values(['FruitID', 'FruitType']).reset_index(drop=True)
答案 2 :(得分:0)
好答案root。我已经修改了您的功能,以便现在当日期范围相交时也可以使用。也许会帮助某人。
def merge_dates(grp):
dt_groups = (grp['StartDate'] > grp['EndDate'].shift()).cumsum()
grouped = grp.groupby(dt_groups).agg({'StartDate': 'min', 'EndDate': 'max'})
if len(grp) == len(grouped):
return grouped
else:
return merge_dates(grouped)