Question

尝试从大DataFrame中找到前n个值。键是我前两列中类似命名对象的组合。但是，无论密钥位于哪一列，我都希望找到最大值。通过例子更好地证明：

import itertools
import pandas as pd

np.random.seed(10)

pairs = [combo for combo in itertools.combinations(['apple','banana','pear','orange'], 2)]

df = pd.DataFrame(pairs, columns=['a','b'])
df['score'] = np.random.rand(6)

原始DataFrame：

In [2]: df
Out[2]: a   b   score
     0  apple   banana  0.771321
     1  apple   pear    0.020752
     2  apple   orange  0.633648
     3  banana  pear    0.748804
     4  banana  orange  0.498507
     5  pear    orange  0.224797

以下是我将如何使用SQL完成任务，假设我有一个名为fruits的数据库表，模仿上面的df：

uniq = pd.unique(df[['a', 'b']].values.ravel())

df_sql = pd.DataFrame()
for fruit in uniq:
    dfsql_tmp = pd.read_sql_query(
    """SELECT a,b,score FROM fruits
    WHERE a = %s
    OR b = %s
    ORDER BY score DESC
    LIMIT 1;""",
    engine, params=[fruit, fruit])

    df_sql = pd.concat([df_sql, dfsql_tmp], ignore_index=True)

这正是我所要求的，来自每个唯一值（来自df['a']和df['b']的联合）的前n个得分。期望的输出：

In [5]: df_sql
Out[5]: a   b   score
     0  apple   banana  0.771321 #highest apple score
     1  apple   banana  0.771321 #highest banana score
     2  apple   orange  0.633648 #highest orange score
     3  banana  pear    0.748804 #highest pear score

修改

这也是诀窍，但规模很慢：

N=1
df_new = pd.DataFrame()
for fruit in uniq:
    df_tmp = df[(df['a'] == fruit) | (df['b'] == fruit)].sort_values('score', ascending=False).head(N)
    df_new= pd.concat([df_new, df_tmp])

有没有更好的方法来获得我想要的结果？嵌套的SQL查询不能扩展。我宁愿在一个大的df上执行操作。保持n也很重要，而不仅仅是最大或最小。

Answer 1

这不是一个漂亮的解决方案，我怀疑那里有更好的解决方案，但这里有一个很好的解决方案。这会创建一个~550k行x5列DataFrame，并在我的笔记本电脑上运行大约4秒钟。

import string
import pandas as pd
import numpy as np
import itertools

np.random.seed(10)
pairs = [combo for combo in itertools.combinations(string.letters + string.digits, 4)]

df = pd.DataFrame(pairs, columns=['a', 'b', 'c', 'd'])
df['score'] = np.random.rand(len(df))

cols = ['a', 'b', 'c', 'd']
indexes = []

for c in pd.concat([df[col] for col in cols]).unique():
    indexes.append(df[reduce(lambda x, y: x | y, [df[col] == c for col in cols])]['score'].idxmax())
print df.ix[indexes]

如果您不希望在输出中保留原始索引，请在末尾添加.reset_index()。

对于前N个，而不是执行.idxmax()，对缩小的帧进行排序，并使用.iloc[:N]获取前N个索引。

pandas在多列中的组名称时排在前n位

修改

1 个答案: