import pandas as pd
import io
import numpy as np
import datetime
data = """
date id
2015-10-31 50230
2015-10-31 48646
2015-10-31 48748
2015-10-31 46992
2015-11-01 46491
2015-11-01 45347
2015-11-01 45681
2015-11-01 46430
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+', index_col=False, parse_dates = ['date'])
df2 = pd.DataFrame(index=df.index)
df2['Check'] = np.where(datetime.datetime.strftime(df['date'],'%B')=='October',0,1)
我有这个我正在使用的例子。 df2['Check']
正在做的是df['date'] == 'October'
然后我分配0,否则为1。
np.where
可以正常使用其他条件,但strftime
不喜欢导致此错误的系列:
Traceback (most recent call last):
File "C:/Users/Leb/Desktop/Python/test2.py", line 22, in <module>
df2['Check'] = np.where(datetime.datetime.strftime(df['date'],'%B')=='October',0,1)
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'Series'
如果我循环,我的实际数据需要很长时间,大约是1M。我怎样才能有效地做到这一点?
df2['Check']
应如下所示:
Check
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
答案 0 :(得分:3)
这是一个稍微简单的版本,使用month
对象的datetime
属性。如果它等于10,只需将true / false值映射到您想要的0/1对:
df2['Check']=df.date.apply(lambda x: x.month==10).map({True:0,False:1})
答案 1 :(得分:0)
@ ako的回答是关于钱的,但基于@ Kartik和@ EdChum的评论,这是我想出的:
import pandas as pd
import io
import numpy as np
data = """
2015-10-31 50230
2015-10-31 48646
2015-10-31 48748
2015-10-31 46992
2015-11-01 46491
2015-11-01 45347
2015-11-01 45681
2015-11-01 46430
"""
df = pd.read_csv(io.StringIO(data*125000), delimiter='\s+', index_col=False, names=['date','id'], parse_dates = ['date'])
df2 = pd.DataFrame(index=df.index)
df.shape
(1125000, 2)
%timeit df2['Check']=df.date.apply(lambda x: x.month==10).map({True:0,False:1})
1 loops, best of 3: 2.56 s per loop
%timeit df2['date'] = np.where(df['date'].dt.month==10,0,1)
10 loops, best of 3: 80.5 ms per loop