我的目标是对第一行进行一些基本计算,并将其分配给数据框中的新列。
简单的例子:
df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})
# drop duplicates
first = df.drop_duplicates(subset='A', keep='first').copy()
%timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
这给
532 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
如果我重置索引,它的速度几乎快了2倍(以防万一是由于某些缓存而导致的差异,我多次以不同的顺序重新运行,结果相同)
# drop duplicates but reset index
first = df.drop_duplicates(subset='A', keep='first').reset_index(drop=True).copy()
%timeit first['H'] = first['A']*first['B'] + first['C']
342 µs ± 7.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
尽管这不是袋子的区别,但我想知道是什么原因造成的。谢谢。
更新:
我重做了这个简单的测试,该问题与索引无关,似乎与数据帧的副本有关:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({k: np.random.randint(0, 1000, 100) for k in list('ABCDEFG')})
In [4]: # drop duplicates
...: first = df.drop_duplicates(subset='A', keep='first').copy()
...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
558 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: # drop duplicates
...: first = df.drop_duplicates(subset='A', keep='first')
...: %timeit first['H'] = first['A']*first['B'] + first['C'] - first['D'].max()
/Users/sam/anaconda3/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/Users/sam_dessa/anaconda3/bin/python
20.7 ms ± 826 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
制作副本并分配一个新列大约需要 532 µs ,但直接对数据框本身进行操作(熊猫也发出了警告)给了 20.7 ms ,同样的原始问题,是什么原因造成的?仅仅是因为花费了时间来发出警告吗?