Question

我正在使用pandas数据帧，并在几列中迭代所有可能的值组合。我是使用itertools.combinations和pandas.Series.unique()执行此操作的：

query_fields = ['direction','subj_id','speed']
query_items = [df_reps[k].unique() for k in query_fields]

for a in itertools.product(*query_items):
    df = df_reps[(df_reps['subj_id']==a[0]) & (df_reps['direction']==a[1]) & (df_reps['speed']==a[2])]
    #Do something with df

我想知道是否有更多pythonic方式来压缩我的数据帧查询。如果我有更多可能的查询字段，那么这种方法将变得越来越不可用。一种可能的方法是迭代所有字段并单独应用每个查询（如漏斗） - 这可以通过列表理解完成，例如：

df = df_reps[(df_reps[qf]==a[i]) for qf,i in enumerate(query_fields)] #Doesn't work

pandas中是否已存在此功能？

修改

输入：DataFrame和包含标题的列表。

输出：循环或类似，选择标题列表指定的列中每个唯一值组合。

Answer 1

＆＃34; 迭代多列中所有可能的值组合的问题＆＃34;可以使用pandas groupby轻松解决。基本上，您可以根据所有列的值创建组，然后检索每个序列出现的数据部分。没有涉及循环，它是一个单行。

import pandas as pd
import numpy as np
import itertools

df = pd.DataFrame(np.random.randint(1,4, (100, 5)),
                  columns = ['direction','subj_id','speed','other1', 'other2'])

fields = ['direction','subj_id','speed']  

grouped_by_values = df.groupby(fields)
queries_results = {key: group for key, group in grouped_by_values }

以下是结果示例：

for key, group in queries_results.iteritems():
#for key, group in grouped_by_values:  #Equivalent, probably better

    print key, group


(1, 1, 1)     direction  subj_id  speed  other1  other2
3           1        1      1       3       3
37          1        1      1       2       3
48          1        1      1       2       1
52          1        1      1       1       3
81          1        1      1       1       1
97          1        1      1       1       1
(1, 1, 2)     direction  subj_id  speed  other1  other2
25          1        1      2       2       3
62          1        1      2       3       1

如果你想知道如何＆＃34; 压缩数据框查询＆＃34;，这里有一种方法：生成一个boolean masks列表（每个条件一个）然后使用reduce生成交集。

以下是一个例子：

import pandas as pd
import numpy as np

# Reproducible Example
df = pd.DataFrame(np.random.randint(1,4, (100, 3)), columns = ['A', 'B', 'C'])
query_fields = ['A','B','C']
query_items = [1,2,3]

# Individual masks
ind_masks = [df[key].eq(val) for key, val in zip(query_fields, query_items)]
# Combined Query
mask = reduce(lambda x, y: x & y, ind_masks)

query_result = df[mask]

Answer 2

使用pandas.DataFrame.drop_duplicates;

function out = q44417404(I,k)

if nargin == 0
  rng(44417404);
  I = randi(2,89,42)-1 == 1;
  k = 3;
end

out = permute(prod(reshape(I(nchoosek(1:size(I,1),k).',:).',size(I,2),k,[]),2),[3,1,2]);

您现在有了一个新的数据框，您可以继续按其他条件进行切片，例如;

unique_df = df_reps.drop_duplicates(['direction','subj_id','speed'])

Python：压缩任意大的数据帧查询

修改

2 个答案: