Question

我有点不高兴。我一整天都在解决一个问题，但没有看到任何实际结果。我正在使用 Python 并使用 Pandas 处理数据。

我想要实现的是基于客户之前的交互来总结每种类型的交互。交互的时间戳应小于调查的时间戳。理想情况下，我想总结一段时间内客户的互动 - 比如少于例如5 年。

第一个数据框包含客户 ID，该客户在该调查期间的细分，例如1 代表“高兴”，2 代表“悲伤”，以及记录片段时间或调查时间的时间戳。

import pandas as pd

#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})

customers

产生如下结果：

<头>

客户 ID	customerSeg	时间戳
1	1	1999-01-01
1	1	2000-01-01
1	1	2000-06-01
2	2	2001-01-01
2	2	2003-01-01
3	3	1999-01-01
4	4	2005-01-01
4	4	2008-01-01

另一个数据框包含与该客户的交互，例如。在服务和电话。

interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions

输出：

<头>

客户 ID	时间戳	服务	电话
1	1999-07-01	1	0
1	1999-11-01	0	1
1	2000-03-01	1	1
1	2001-04-01	0	1
2	2000-12-01	1	1
2	2002-01-01	0	1
2	2004-03-01	1	0
2	2004-05-01	1	1
4	2000-01-01	0	1
4	2004-01-01	1	0
4	2009-01-01	1	1

之前所有交互的结果（理想情况下，我只想要过去 5 年）：

<头>

客户 ID	customerSeg	时间戳	服务	电话
1	1	1999-01-01	0	0
1	1	2000-01-01	1	1
1	1	2000-06-01	2	2
2	2	2001-01-01	1	1
2	2	2003-01-01	1	2
3	3	1999-01-01	0	0
4	4	2005-01-01	1	1
4	4	2008-01-01	1	1

我几乎什么都试过了，我能想出。所以，我真的很感激一些投入。我几乎只使用 Pandas 和 Python，因为这是我最熟悉的语言，但也因为我需要读取客户细分的 csv 文件。

Answer 1

我认为转换数据需要几个步骤。

首先，我们将两个数据帧中的 timestamp 列转换为 datetime，这样我们就可以计算所需的间隔并进行比较：

customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])

之后，我们创建一个包含该开始日期（例如时间戳前 5 年）的新列：

customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)

现在我们将 customers 数据框与 interactions 上的 customerID 数据框连接起来：

result = customers.merge(interactions, on='customerID', how='outer')

这产生了

    customerID  customerSeg timestamp_x start_date timestamp_y  service  phonecall
0            1            1  1999-01-01 1994-01-01  1999-07-01      1.0        0.0
1            1            1  1999-01-01 1994-01-01  1999-11-01      0.0        1.0
2            1            1  1999-01-01 1994-01-01  2000-03-01      1.0        1.0
3            1            1  1999-01-01 1994-01-01  2001-04-01      0.0        1.0
4            1            2  2000-01-01 1995-01-01  1999-07-01      1.0        0.0
5            1            2  2000-01-01 1995-01-01  1999-11-01      0.0        1.0
6            1            2  2000-01-01 1995-01-01  2000-03-01      1.0        1.0
7            1            2  2000-01-01 1995-01-01  2001-04-01      0.0        1.0
...

现在这里是条件的评估方式 - 我们想要的是只有那些 service 和 phonecall 交互将被使用在满足条件的行中（timestamp_y 在start_date 和 timestamp_x 之间的间隔），所以我们用零替换其他的：

result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)

最后我们对数据框进行分组，总结 service 和 phonecall 交互：

result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()

结果：

                                    service  phonecall
customerID timestamp_x customerSeg                    
1          1999-01-01  1                0.0        0.0
           2000-01-01  2                1.0        1.0
           2000-06-01  2                2.0        2.0
2          2001-01-01  1                1.0        1.0
           2003-01-01  2                1.0        2.0
3          1999-01-01  3                0.0        0.0
4          2005-01-01  3                1.0        1.0
           2008-01-01  3                1.0        0.0

（请注意，示例代码中的 customerSeg 数据似乎与表中的数据不太匹配。）

使用熊猫的日期条件分组

1 个答案: