我最近阅读了一篇很棒的文章 (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4),它表明矢量化比迭代快得多,并希望实践。我当前的代码,在 200 万行上,需要大约 16 小时才能完成以下保存在 Pandas DataFrame 对象“data”中的示例:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 Date object
1 Time object
2 Open float64
3 Close float64
4 datetime datetime64[ns]
print(data)
Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00
工作代码删除了上午 9:30 之前和下午 3:59 之后的时间:
keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
pass
elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
pass
else:
keep.append(row[0])
print(row[0], "/", end)
data = data.loc[keep, :]
矢量化对我来说是新的,我尝试过一些操作,但我觉得因为它是一个系列,比较或设置值是一个问题,因为它不是一个单独的数字。从阅读来看,我似乎需要做一个函数,这样我才能做到: data['keep_it'] = my_fun(data['datetime'])
失败的尝试:
data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep # a vectorized approach
if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
return 1
# AttributeError: 'Series' object has no attribute 'year'
if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
return 1
# AttributeError: 'Series' object has no attribute 'year'
有什么想法吗? 谢谢!
答案 0 :(得分:1)
这是矢量化的。
import datetime as dt
df = pd.read_csv(io.StringIO(""" Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00"""), sep="\s\s+", engine="python")
df["datetime"] = pd.to_datetime(df["datetime"])
df.loc[df["datetime"].dt.time.between(dt.time(9,30),dt.time(15,59))]
答案 1 :(得分:1)
感谢@MrFuppes,我设计了这种基本上是即时的粗略方法:
testing = pd.DatetimeIndex(data['datetime'])
data = data[(testing.hour<16) & (testing.hour*60+testing.minute >= 9*60+30)]
改进的空间包括使用单行删除测试,并可能正确利用 DateTimeIndex .time 属性