Question

我使用Pandas 0.12.0。说multi_df是具有多个索引的Pandas数据帧。我有一个（长）元组列表（多个索引），名为look_up_list。如果look_up_list中的元组位于multi_df，我想执行操作。

以下是我的代码。 有更快的方法来实现这一目标吗？实际上len(multi_df)和len(look_up_list)非常大，所以我需要优化这一行：{{1 }}

特别是，line_profiler告诉我，连续检查：[multi_df.ix[idx]**2 for idx in look_up_list if idx in multi_df.index]需要很长时间。

if idx in multi_df.index

P.S：列表理解中的实际操作不是import pandas as pd df = pd.DataFrame({'id' : range(1,9), 'code' : ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two'], 'colour': ['black', 'white','white','white', 'black', 'black', 'white', 'white'], 'texture': ['soft', 'soft', 'hard','soft','hard', 'hard','hard','hard'], 'shape': ['round', 'triangular', 'triangular','triangular','square', 'triangular','round','triangular'] }, columns= ['id','code','colour', 'texture', 'shape']) multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id'] # define the list of indices that I want to look up for in multi_df look_up_list = [('two', 'white', 'hard', 'triangular'),('five', 'black', 'hard', 'square'),('four', 'black', 'hard', 'round') ] # run a list comprehension [multi_df.ix[idx]**2 for idx in look_up_list if idx in multi_df.index]，而是类似：multi_df.ix[idx]**2。

Answer 1

也许使用multi_df.loc[look_up_list].dropna()。

import pandas as pd
df = pd.DataFrame(
    {'id': range(1, 9),
     'code': ['one', 'one', 'two', 'three',
              'two', 'three', 'one', 'two'],
     'colour': ['black', 'white', 'white', 'white',
                'black', 'black', 'white', 'white'],
     'texture': ['soft', 'soft', 'hard', 'soft', 'hard',
                 'hard', 'hard', 'hard'],
     'shape': ['round', 'triangular', 'triangular', 'triangular', 'square',
               'triangular', 'round', 'triangular']
     }, columns=['id', 'code', 'colour', 'texture', 'shape'])
multi_df = df.set_index(
    ['code', 'colour', 'texture', 'shape']).sort_index()['id']

# define the list of indices that I want to look up for in multi_df
look_up_list = [('two', 'white', 'hard', 'triangular'), (
    'five', 'black', 'hard', 'square'), ('four', 'black', 'hard', 'round')]

subdf = multi_df.loc[look_up_list].dropna()
print(subdf ** 2)

产量

(two, white, hard, triangular)     9
(two, white, hard, triangular)    64
Name: id, dtype: float64

注意：

multi_df是一个系列，而不是一个DataFrame。我没有认为这会影响解决方案。
您在上面发布的代码会引发IndexingError: Too many indexers 所以我猜想（一点点）代码的意图。

优化Pandas多索引查找

1 个答案: