我有一个CSV
客户购买文件,我没有按特定顺序阅读Pandas
Dataframe
。我想为每次购买添加一列,并显示自上次购买以来已经过了多少时间,按客户分组。我不知道它在哪里得到差异,但它们太大了(即使在几秒钟内)。
CSV:
Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015
的Python:
import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
.diff()
.fillna('-')
)
print data
输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 2678400000000000
4 2322 2015-03-01 2419200000000000
0 4543 2015-01-01 -
1 4543 2015-02-05 3024000000000000
2 4543 2015-03-15 328320000000000
期望的输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 -
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
答案 0 :(得分:4)
我认为您可以添加read_csv
参数parse_dates
来解析datetime
,sort_values
以及使用groupby
解析diff
:
import pandas as pd
import io
temp=u"""Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015"""
#after testing replace io.StringIO(temp) to filename
data = pd.read_csv(io.StringIO(temp), parse_dates=['Purchase Date'])
data.sort_values(by=['Customer Id', 'Purchase Date'], inplace=True)
data['Purchase Difference'] = data.groupby(['Customer Id'])['Purchase Date'].diff()
print data
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 NaT
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 NaT
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
答案 1 :(得分:4)
一旦转换为时间戳,您就可以将diff
应用于Purchase Date
列。
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)
df['Purchase Difference'] = \
[str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else ""
for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]
>>> df
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
6 4543 2015-03-15