Question

我有一个带有大约的pandas数据框。 250,000行x 6列。其中一列是日期，格式为文本。我需要做三件事：

从文本转换为日期
创建一个日期，其中月份和年份与转换日期相同，但日期始终为第15天
计算上面计算的日期后一个月的日期

我使用apply语句完成所有这些操作。它们工作，但对我来说似乎很慢：总共7秒，而任何SQL在同一台机器上即使没有并行化也需要几分之一秒。如果这是一次性的，我不会花时间加快速度，但我必须在类似大小的多个数据帧上多次这样做。

有什么方法可以加快我的代码速度吗？非常感谢！

#this takes 3.1 seconds
df['date_reformatted'] = df['date_raw'].apply(lambda r: datetime.datetime.strptime(r, "%d/%m/%Y") )

# this takes 0.8 seconds
df['date_15']= df['date_reformatted'].apply(lambda r: datetime.date( r.year, r.month,15 ) ) 

# this takes 3.3 seconds
df['date_next_month']= df['date_15'].apply(lambda x: x + dateutil.relativedelta.relativedelta(months=1) )

Answer 1

是的，你可以做到

df['date_formatted'] = pd.to_datetime(df['date_raw'], format= "%d/%m/%Y")

第二位有点奇怪，我看不到如何对它进行矢量化，但是你可以通过单个循环得到两个列

pd.DataFrame([(datetime.date(d.year, d.month, 15), 
               datetime.date(d.year, d.month + 1, 15)) for d in df.date_formatted], 
               columns=['date_15', 'date_next_month'])

可能会快一点。

Answer 2

尝试使用整数和字符串。如果你真的需要它们，只能转换为datetime对象。

%%timeit -n10  df = pd.DataFrame({'date_raw': ['31/12/2000']*250000})
_, months, years = zip(*df.date_raw.str.split('/'))
months_years = [(1 if m == '12' else int(m) + 1, 
                 int(y) + 1 if m == '12' else int(y)) 
                for m, y in zip(months, years)]
# New dates in dd-mm-yyyy format:
df['new_date'] = ['15-{0}-{1}'.format(x[0], x[1]) for x in months_years]

10 loops, best of 3: 583 ms per loop

>>> df.tail()
          date_raw   new_date
249995  31/12/2000  15-1-2001
249996  31/12/2000  15-1-2001
249997  31/12/2000  15-1-2001
249998  31/12/2000  15-1-2001
249999  31/12/2000  15-1-2001

新日期采用文字形式（这就是为什么它很快）。创建日期时间对象有点耗时，但如果你真的需要它们：

%%timeit
df['new_date'].apply(lambda r: datetime.datetime.strptime(r, "%d-%m-%Y") )

1 loops, best of 3: 2.72 s per loop

Answer 3

In [51]: df = pd.DataFrame({'date_raw': pd.to_datetime(['2000-12-31']*250000)}) 

In [66]: %timeit pd.DataFrame({'date_raw': pd.to_datetime(['2000-12-31']*250000)})
10 loops, best of 3: 47.4 ms per loop

In [52]: df       
Out[52]: 
         date_raw
0      2000-12-31
1      2000-12-31
2      2000-12-31
3      2000-12-31
4      2000-12-31
5      2000-12-31
...           ...
249994 2000-12-31
249995 2000-12-31
249996 2000-12-31
249997 2000-12-31
249998 2000-12-31
249999 2000-12-31

[250000 rows x 1 columns]

In [53]: df['date'] = pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('D') + pd.Timedelta('14d')

In [54]: df
Out[54]: 
         date_raw       date
0      2000-12-31 2000-12-15
1      2000-12-31 2000-12-15
2      2000-12-31 2000-12-15
3      2000-12-31 2000-12-15
4      2000-12-31 2000-12-15
5      2000-12-31 2000-12-15
...           ...        ...
249994 2000-12-31 2000-12-15
249995 2000-12-31 2000-12-15
249996 2000-12-31 2000-12-15
249997 2000-12-31 2000-12-15
249998 2000-12-31 2000-12-15
249999 2000-12-31 2000-12-15

[250000 rows x 2 columns]

计时

In [55]: %timeit pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('D') + pd.Timedelta('14d')
10 loops, best of 3: 62.1 ms per loop

在合并PR之后，这将更加紧凑。 IOW， pd.DatetimeIndex(df.date_raw).to_period('M').to_timestamp('15D')

如果您再次转换为句点，然后在这种情况下添加1个相同频率，月份，则您的问题3很容易。这也是矢量化的。

In [80]: df['date_plus_1'] = (pd.DatetimeIndex(df.date).to_period('M') + 1).to_timestamp('D') + pd.Timedelta('14d')

In [81]: df
Out[81]: 
         date_raw       date date_plus_1
0      2000-12-31 2000-12-15  2001-01-15
1      2000-12-31 2000-12-15  2001-01-15
2      2000-12-31 2000-12-15  2001-01-15
3      2000-12-31 2000-12-15  2001-01-15
4      2000-12-31 2000-12-15  2001-01-15
5      2000-12-31 2000-12-15  2001-01-15
...           ...        ...         ...
249994 2000-12-31 2000-12-15  2001-01-15
249995 2000-12-31 2000-12-15  2001-01-15
249996 2000-12-31 2000-12-15  2001-01-15
249997 2000-12-31 2000-12-15  2001-01-15
249998 2000-12-31 2000-12-15  2001-01-15
249999 2000-12-31 2000-12-15  2001-01-15

[250000 rows x 3 columns]

In [82]: %timeit (pd.DatetimeIndex(df.date).to_period('M') + 1).to_timestamp('D') + pd.Timedelta('14d')
10 loops, best of 3: 56.7 ms per loop

Python pandas：我可以加快这个apply语句吗？

3 个答案: