我有以下数据框:
sale_id created_at
1 2016-05-28T05:53:31.042Z
2 2016-05-30T12:50:58.184Z
3 2016-05-23T10:22:18.858Z
4 2016-05-27T09:20:15.158Z
5 2016-05-21T08:30:17.337Z
6 2016-05-28T07:41:14.361Z
我需要添加一年周列,其中包含created_at列中每行的年份和周数:
sale_id created_at year_week
1 2016-05-28T05:53:31.042Z 2016-21
2 2016-05-30T12:50:58.184Z 2016-22
3 2016-05-23T10:22:18.858Z 2016-21
4 2016-05-27T09:20:15.158Z 2016-21
5 2016-05-21T08:30:17.337Z 2016-20
6 2016-05-28T07:41:14.361Z 2016-21
我更喜欢可以轻松转移到pyspark的解决方案。
答案 0 :(得分:5)
您可以使用strftime
:
#if dtype is not datetime
df.created_at = pd.to_datetime(df.created_at)
df['year_week'] = df.created_at.dt.strftime('%Y-%U')
print (df)
sale_id created_at year_week
0 1 2016-05-28 05:53:31.042 2016-21
1 2 2016-05-30 12:50:58.184 2016-22
2 3 2016-05-23 10:22:18.858 2016-21
3 4 2016-05-27 09:20:15.158 2016-21
4 5 2016-05-21 08:30:17.337 2016-20
5 6 2016-05-28 07:41:14.361 2016-21
df['year_week'] = df.created_at.dt.year.astype(str) + '-' +
df.created_at.dt.week.astype(str)
print (df)
sale_id created_at year_week
0 1 2016-05-28 05:53:31.042 2016-21
1 2 2016-05-30 12:50:58.184 2016-22
2 3 2016-05-23 10:22:18.858 2016-21
3 4 2016-05-27 09:20:15.158 2016-21
4 5 2016-05-21 08:30:17.337 2016-20
5 6 2016-05-28 07:41:14.361 2016-21
答案 1 :(得分:4)
更新: PySpark DF解决方案:
from pyspark.sql.functions import *
df.withColumn('year_week', df.select(date_format('created_at', 'yyyy-w'))
老熊猫解决方案:
试试这个:
df['year_week'] = df.created_at.dt.year.astype(str) + '-' + df.created_at.dt.weekofyear.astype(str)
In [29]: df
Out[29]:
sale_id created_at year_week
0 1 2016-05-28 05:53:31.042 2016-21
1 2 2016-05-30 12:50:58.184 2016-22
2 3 2016-05-23 10:22:18.858 2016-21
3 4 2016-05-27 09:20:15.158 2016-21
4 5 2016-05-21 08:30:17.337 2016-20
5 6 2016-05-28 07:41:14.361 2016-21
针对600K行的时间DF:
In [33]: df = pd.concat([df] * 10**5, ignore_index=True)
In [34]: %timeit df.created_at.dt.strftime('%Y-%U')
1 loop, best of 3: 16.1 s per loop
In [35]: %timeit df.created_at.dt.year.astype(str) + '-' + df.created_at.dt.weekofyear.astype(str)
1 loop, best of 3: 7.43 s per loop
In [43]: %timeit df.created_at.dt.year.astype(str) + '-' + df.created_at.dt.week.astype(str)
1 loop, best of 3: 7.45 s per loop
In [36]: df.shape
Out[36]: (600000, 2)