我有2个数据框,我想以一种特殊的方式合并这两个数据框。
Dataframe 1: SF
CustomerID CaseID Datetime
1 1 09-09-2018 18:02:00
1 2 16-09-2018 09:06:00
2 3 18-09-2018 11:07:00
Dataframe 2: apps
CustomerID Text Datetime
1 Hello, I want to know.. 09-09-2018 18:00:00
1 Thank you for your question,.. 09-09-2018 18:05:00
1 Oke thank you 10-09-2018 18:20:00
1 Hello, can you help me with... 16-09-2018 09:05:00
1 Yes,.... 16-09-2018 09:10:00
2 Hi, where can I find.... 18-09-2018 11:06:00
2 Hi, you can find it... 18-09-2018 11:09:00
2 Thanks! 18-09-2018 11:15:00
两个数据框之间的通用ID为客户ID。但是我想将文本消息合并到正确的CaseID,以便得到以下结果:
Dataframe 3: combined
CustomerID Text Datetime CaseID
1 Hello, I want to know.. 09-09-2018 18:00:00 1
1 Thank you for your question,.. 09-09-2018 18:05:00 1
1 Oke thank you 10-09-2018 18:20:00 1
1 Hello, can you help me with... 16-09-2018 09:05:00 2
1 Yes,.... 16-09-2018 09:10:00 2
2 Hi, where can I find.... 18-09-2018 11:06:00 3
2 Hi, you can find it... 18-09-2018 11:09:00 3
2 Thanks! 18-09-2018 11:15:00 3
我认为您可以按以下方式执行此操作(伪代码): 对于客户ID的每个caseID,请从数据框应用中获取所有文本消息,直到该客户ID的下一个CaseID为止。但我不知道如何用python代码编写此代码。
我希望有人能帮助我。
答案 0 :(得分:1)
在我看来,您似乎想根据显示的SF数据框检查某个日期是否出现两个日期。但是,令我惊讶的是:
1 Hello, can you help me with... 16-09-2018 09:05:00 2
当日期介于case_id 1和2之间时,实际上有case_id2。如果您要查找的是以下内容,则可能会有所帮助。首先,我重新创建了您的数据框。
import pandas as pd
# Create DataFrames as in example
sf_dates = [pd.to_datetime(i) for i in
['09-09-2018 18:02:00', '16-09-2018 09:06:00', '18-09-2018 11:07:00']]
apps_date = [pd.to_datetime(i) for i in
['09-09-2018 18:00:00', '09-09-2018 18:05:00', '09-10-2018 18:20:00',
'16-09-2018 09:05:00', '16-09-2018 09:10:00', '18-09-2018 11:06:00',
'18-09-2018 11:09:00','18-09-2018 11:15:00']]
apps = pd.DataFrame({'date':apps_date, 'customer_id':[1, 1, 1, 1, 1, 2, 2, 2]})
case = pd.DataFrame({'date':sf_dates, 'case_id':[1, 2, 3]})
然后,我确定了第一个边缘情况,即您希望case_id 2之前的所有日期都使用case_id 1:
edge_case_1 = (case.iloc[case.date.idxmin()].case_id,
case.iloc[case.date.idxmin()+1].date)
案例2的边缘是您想要case_id 3之后的所有日期的case_id 3:
edge_case_2 = (case.iloc[case.date.idxmax()].case_id, case.iloc[case.date.idxmax()].date)
然后,构造一个字典,为剩下的每个case_id创建一个开始和结束日期,以指示某个case_id应该在两个日期之间:
date_ranges = {case.loc[x, 'case_id']: (case.iloc[x].date, case.iloc[x+1].date)
for x in range(1, len(case)-1)}
最后,使用apply将其应用于数据框:
def return_case_id(row, date_ranges, edge_case_1, edge_case_2):
# Check for edge case 1
if row.date < edge_case_1[1]:
return edge_case_1[0]
# Check for edge case 2
elif row.date > edge_case_2[1]:
return edge_case_2[0]
# Check for all other cases (between two dates)
else:
for case_id, dates in date_ranges.items():
if (row.date > dates[0]) & (row.date < dates[1]):
return case_id
# To check if everything happened as supposed to
return
apps['case_id'] = apps.apply(lambda row: return_case_id(row, date_ranges,
edge_case_1,
edge_case_2), 1)
答案 1 :(得分:0)
谢谢您的回答,马尔滕。但这不完全是我的意思。 我现在已经按照以下方式进行了,并且工作正常。除了 !有两个问题: 1.这非常慢(必须对具有20k条记录的应用程序数据执行此操作) 2.我在尝试发挥功能时卡住了
import pandas as pd
import numpy as np
# Create DataFrames as in example
sf_dates = [pd.to_datetime(i) for i in
['09-09-2018 18:00:00', '16-09-2018 09:05:00', '18-09-2018 11:10:00']]
apps_date = [pd.to_datetime(i) for i in
['09-09-2018 18:00:00', '09-09-2018 18:05:00', '09-10-2018 18:20:00',
'16-09-2018 09:05:00', '16-09-2018 09:10:00', '18-09-2018 11:08:00',
'18-09-2018 11:09:00','18-09-2018 11:15:00', '22-09-2018 11:15:00']]
apps = pd.DataFrame({'date':apps_date, 'customer_id':[1, 1, 1, 1, 1, 2, 2, 2,4]})
case = pd.DataFrame({'date':sf_dates, 'case_id':[1, 2, 3], 'customer_id':[1, 1, 2]})
让我们将case_id分配给应用程序数据 apps ['case_id'] = np.nan#向应用数据框添加一个新的空列'case_id'
for index_apps, row_apps in apps.iterrows(): # iterate over each row in apps data
# make a new data set witch is a subselection of the case data, where the customer_id is the same as the customer_id in the row of the apps data
case_selection = case[case.customer_id == row_apps['customer_id']]
case_selection = case_selection.reset_index(drop=True) # reset the index, so that index has successive numbers
index_case_selection=0
while index_case_selection>= 0:
if case_selection.empty:
# When the customer_id only exist in the apps dataframe and not in the case dataframe, then it isn't possible to assign a case_id to that app row.
# so the case_id is NaN
index_case_selection = -1
elif (index_case_selection == (len(case_selection.index))-1) and (apps.date[index_apps] >= case_selection.date[index_case_selection]):
#when the iteration is at the last row (or the first row if there's only one) of the case_selection dataframe and date of the apps dataframe is bigger or equal than the date of the case_selection
# then assign that casenumber to the apps dataframe
apps.case_id[index_apps] = case_selection.case_id[index_case_selection]
index_case_selection = -1
elif (index_case_selection == (len(case_selection.index))-1):
#when the iteration is at the last row (or the first row if there's only one) of the case_selection dataframe and date of the apps dataframe is smaller than the date of the case_selection,
# then it isn't possible to assign a case_id to that app row. So the case_id is NaN
index_case_selection = -1
elif (apps.date[index_apps] >= case_selection.date[index_case_selection]) and (apps.date[index_apps] < case_selection.date[index_case_selection+1]):
#when apps date is equal or bigger than the case_selection date and lower than the case_selection date
apps.case_id[index_apps] = case_selection.case_id[index_case_selection]
index_case_selection = -1
else:
index_case_selection += 1