我是熊猫新手。
我有一个非常简单的数据框,名为dlf
,带有索引,两列有40k行。它按如下方式加载:
d = pd.DataFrame.from_csv(csvsLocation + 'name.csv', index_col='ID', infer_datetime_format=True)
d['LAST'] = pd.to_datetime(d['LAST'], format = '%d-%b-%y')
d['FIRST'] = pd.to_datetime(d['FIRST'], format = '%d-%b-%y')
dlf = d[['LAST', 'FIRST']]
它看起来像这样:
LAST FIRST
ID
1 1997-04-17 1991-10-04
3 2009-02-13 1988-07-07
5 2009-10-24 1995-12-06
6 1996-04-31 1989-03-14
运行此apply方法需要5秒:
year = 1997
dlf[str(year)] = dlf.apply(lambda row: 1*(year >= row['FIRST'].year and year <= row['LAST'].year), axis=1)
我需要加快速度,因为我打算运行数百次。
我怀疑问题在于使用lambda。
我做错了什么,和/或我怎样才能加快速度?
答案 0 :(得分:4)
您可以在两个日期列中通过dt.year
访问年份:
year = 1999
df[str(year)] = 1 * ((df['FIRST'].dt.year <= year) & (df['LAST'].dt.year >= year))
print(df)
输出:
LAST FIRST 1999
ID
1 1997-04-17 1991-10-14 0
3 2009-02-13 1988-07-07 1
5 2009-10-24 1995-10-06 1
6 1996-04-30 1969-03-14 0
您还可以将布尔值保留为结果:
df[str(year)] = (df['FIRST'].dt.year <= year) & (df['LAST'].dt.year >= year)
print(df)
输出:
LAST FIRST 1999
ID
1 1997-04-17 1991-10-14 False
3 2009-02-13 1988-07-07 True
5 2009-10-24 1995-10-06 True
6 1996-04-30 1969-03-14 False
测量性能总是很有趣。但测量可能很棘手。如果我们只使用4行的微小示例数据帧,事情就会慢一点:
%timeit dlf[str(year)] = dlf.apply(lambda row: 1*(year >= row['FIRST'].year and year <= row['LAST'].year), axis=1)
1000 loops, best of 3: 1.27 ms per loop
%timeit df[str(year)] = 1 * ((df['FIRST'].dt.year <= year) & (df['LAST'].dt.year >= year))
100 loops, best of 3: 1.7 ms per loop
但是让我们来看看40k行:
big = pd.concat([df] * 10000)
>>> big.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 1 to 6
Data columns (total 4 columns):
LAST 40000 non-null datetime64[ns]
FIRST 40000 non-null datetime64[ns]
1999 40000 non-null bool
1997 40000 non-null int64
dtypes: bool(1), datetime64[ns](2), int64(1)
memory usage: 1.3 MB
现在我们可以看到显着的加速:
%timeit big[str(year)] = big.apply(lambda row: 1*(year >= row['FIRST'].year and year <= row['LAST'].year), axis=1)
1 loops, best of 3: 6.51 s per loop
%timeit big[str(year)] = 1 * ((big['FIRST'].dt.year <= year) & (big['LAST'].dt.year >= year))
100 loops, best of 3: 8.33 ms per loop
这大约快780倍。
答案 1 :(得分:1)
我会预先计算first_year
和last_year
以简化比较:
dlf[year] = dlf[dlf['first_year'] <= year & [dlf['last_year'] >= year]
答案 2 :(得分:1)
如果我正确地理解了您的问题,那么您将添加多个列(多年),这是一个通用矢量化解决方案,因此您无需重复100次:
years = [1997, 2016, 2000, 1989]
years = sorted(years)
dfy = pd.DataFrame(pd.Series(years * len(df)).reshape(len(df),len(years)), columns=years)
df = df.join(dfy.apply(lambda x: x.between(df.FIRST.dt.year, df.LAST.dt.year)).astype(int))
df.columns = df.columns.astype(str)
一步一步:
In [160]: years = [1997, 2016, 2000, 1989]
In [161]: years = sorted(years)
In [162]: dfy = pd.DataFrame(pd.Series(years * len(df)).reshape(len(df),len(years)), columns=years)
In [163]: dfy
Out[163]:
1989 1997 2000 2016
0 1989 1997 2000 2016
1 1989 1997 2000 2016
2 1989 1997 2000 2016
3 1989 1997 2000 2016
In [164]: dfy.apply(lambda x: x.between(df.FIRST.dt.year, df.LAST.dt.year)).astype(int)
Out[164]:
1989 1997 2000 2016
0 0 1 0 0
1 1 1 1 0
2 0 1 1 0
3 1 0 0 0
In [165]: df = df.join(dfy.apply(lambda x: x.between(df.FIRST.dt.year, df.LAST.dt.year)).astype(int))
In [166]: df.columns = df.columns.astype(str)
In [167]: df
Out[167]:
FIRST LAST 1989 1997 2000 2016
0 1991-10-04 1997-04-17 0 1 0 0
1 1988-07-07 2009-02-13 1 1 1 0
2 1995-12-06 2009-10-24 0 1 1 0
3 1989-03-14 1996-04-30 1 0 0 0