Question

亲自试试：

import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop

更新：这是a legitimate bug in pandas，可能是2014年8月左右在0.15.1中引入的。解决方法：在使用旧版本的pandas时等待新版本;得到一个前沿的开发。来自github的版本;在您的pandas版本中手动执行单行修改;暂时使用.ix代替.loc。

我有一个480万行的DataFrame，使用.iloc[[ id ]]选择一行（带单个元素列表）需要489毫秒，差不多半秒，比相同的.ix[[ id ]]，比 .iloc[id]慢3,500倍（将id作为值传递，而不是作为列表传递）。公平地说，无论列表的长度如何，.loc[list]大约需要相同的时间，但我不想花费 489 ms ，尤其是.ix时快一千倍，并产生相同的结果。据我所知，.ix应该更慢，不是吗？

我使用的是熊猫0.15.1。关于Indexing and Selecting Data的优秀教程表明，.ix在某种程度上比.loc和.iloc更为一般，并且可能更慢。具体来说，它说

但是，当轴是基于整数的时，仅限基于标签的访问和不支持位置访问。因此，在这种情况下，通常是这样更好地明确并使用.iloc或.loc。

这是一个带基准的iPython会话：

print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1) print 'df.index begins with ', df.index[:20] print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist()) # First extract one element directly. Expected result, no issues here. id=5965356 print 'Extract one element with id %d' % id %timeit df.loc[id] %timeit df.ix[id] print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result # Now extract this one element as a list. %timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id] %timeit df.ix[[id]] print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]])) # this one should be True # Let's double-check that in this case .ix is the same as .loc, not .iloc, # as this would explain the difference. try: print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]])) except: print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df)) # Finally, for the sake of completeness, let's take a look at iloc %timeit df.iloc[3456789] # this is still 100+ times faster than the next version %timeit df.iloc[[3456789]]

输出：

The dataframe has 4826616 entries, indexed by integers that are less than 6177817 df.index begins with Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64') The index is sorted: True Extract one element with id 5965356 10000 loops, best of 3: 139 µs per loop 10000 loops, best of 3: 141 µs per loop True 1 loops, best of 3: 489 ms per loop 1000 loops, best of 3: 270 µs per loop True Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows 10000 loops, best of 3: 98.9 µs per loop 100 loops, best of 3: 12 ms per loop

Answer 1

看起来pandas 0.14中没有该问题。我用line_profiler描述了它，我想我知道发生了什么。由于pandas 0.15.1，如果给定的索引不存在，则会引发KeyError。看起来当您使用.loc[list]语法时，它会对整个轴上的索引进行详尽的搜索，即使已找到它。也就是说，首先，如果找到元素，则没有提前终止;其次，在这种情况下搜索是强力的。

File: .../anaconda/lib/python2.7/site-packages/pandas/core/indexing.py，

  1278                                                       # require at least 1 element in the index
  1279         1          241    241.0      0.1              idx = _ensure_index(key)
  1280         1       391040 391040.0     99.9              if len(idx) and not idx.isin(ax).any():
  1281                                           
  1282                                                           raise KeyError("None of [%s] are in the [%s]" %

Answer 2

Pandas索引是疯狂的慢，我切换到numpy索引

df=pd.DataFrame(some_content)
# takes forever!!
for iPer in np.arange(-df.shape[0],0,1):
    x = df.iloc[iPer,:].values
    y = df.iloc[-1,:].values
# fast!        
vals = np.matrix(df.values)
for iPer in np.arange(-vals.shape[0],0,1):
    x = vals[iPer,:]
    y = vals[-1,:]

为什么DataFrame.loc [[1]]比df.ix [[1]]慢1800倍，比df.loc [1]慢3,500？

2 个答案: