为什么在未排序索引的Pandas数据框上创建新列很慢

时间:2019-03-01 17:21:36

标签: python pandas dataframe

我的目标是对第一行进行一些基本计算,并将其分配给数据框中的新列。

简单的例子:

df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})

# drop duplicates 
first = df.drop_duplicates(subset='A', keep='first').copy()
%timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()

这给

532 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如果我重置索引,它的速度几乎快了2倍(以防万一是由于某些缓存而导致的差异,我多次以不同的顺序重新运行,结果相同)

# drop duplicates but reset index
first = df.drop_duplicates(subset='A', keep='first').reset_index(drop=True).copy()
%timeit  first['H'] = first['A']*first['B'] + first['C']

342 µs ± 7.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

尽管这不是袋子的区别,但我想知道是什么原因造成的。谢谢。

更新

我重做了这个简单的测试,该问题与索引无关,似乎与数据帧的副本有关:

In [1]: import pandas as pd
In [2]: import numpy as np

In [3]: df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})

In [4]: # drop duplicates
   ...: first = df.drop_duplicates(subset='A', keep='first').copy()
   ...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
558 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: # drop duplicates
   ...: first = df.drop_duplicates(subset='A', keep='first')
   ...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
/Users/sam/anaconda3/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/Users/sam_dessa/anaconda3/bin/python
20.7 ms ± 826 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

制作副本并分配一个新列大约需要 532 µs ,但直接对数据框本身进行操作(熊猫也发出了警告)给了 20.7 ms ,同样的原始问题,是什么原因造成的?仅仅是因为花费了时间来发出警告吗?

0 个答案:

没有答案