假设我有一个看起来像这样的DataFrame:
import pandas as pd
df = pd.DataFrame({'x': [1,2,3], 'f': [lambda x: x + 1,
lambda x: x ** 2,
lambda x: x / 5]})
我想将'f'应用于每个'x'到新的列'y'中。我现在的方式是使用Apply,但这有点慢。有没有更好的办法?将lambda存储在DataFrames中不是一个好主意吗?
df['y'] = df.apply(lambda row: row['f'](row['x']), axis=1)
答案 0 :(得分:1)
将lambda存储在DataFrames中不是一个好主意吗?
我认为是的,因为熊猫只对标量有效。
如果在列表理解中使用循环,则速度更快:
df = pd.DataFrame({'x': [1,2,3], 'f': [lambda x: x + 1,
lambda x: x ** 2,
lambda x: x / 5]})
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [97]: %timeit df['y'] = df.apply(lambda row: row['f'](row['x']), axis=1)
104 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [98]: %timeit df['y1'] = [f(x) for f, x in zip(df['f'], df['x'])]
3 ms ± 93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#300k
df = pd.concat([df] * 100000, ignore_index=True)
In [102]: %timeit df['y'] = df.apply(lambda row: row['f'](row['x']), axis=1)
10.3 s ± 315 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [103]: %timeit df['y1'] = [f(x) for f, x in zip(df['f'], df['x'])]
318 ms ± 4.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)