尝试通过在其他列上应用条件来过滤出数据框中的列

时间:2020-09-24 17:18:14

标签: python pandas pandas-groupby

我在一个csv文件中有3列:account_id,game_variant,no_of_games ....表看起来像这样


account_id    game_variant   no_of_games
130               a             2
145               c             1
130               b             4
130               c             1
142               a             3
140               c             2
145               b             5

所以,我不想提取在变体a,b,c,a∩b,b∩c,a∩c,a∩b∩c中玩过的游戏

通过与game_variant分组并在no_of_games上进行求和,我能够分别提取在a,b,c中玩过的游戏,但无法在逻辑上放入相交部分。请帮我

data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

预先感谢

1 个答案:

答案 0 :(得分:1)

在这里,该解决方案将根据每个玩家的等级返回交集。这额外使用了defaultdict,因为在这种情况下这非常方便。我将内联解释代码

from itertools import combinations
import pandas
from collections import defaultdict
from pprint import pprint  # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+')  # assuming the data frame is in a file df.csv

# group by account_id to get subframes which only refer to one account.
data_agg2 = df.groupby(['account_id'])

# a defaultdict is a dictionary, where when no key is present, the function defined
# is used to create the element. This eliminates the check, if a key is
# already present or to set all combinations in advance.
games_played_2 = defaultdict(int)

# iterate over all accounts
for el in data_agg2.groups:
    # extract the sub-dataframe from the gouped function
    tmp = data_agg2.get_group(el)
    # print(tmp)  # you can uncomment this to see each account
    
    # This is in principle the same loop as suggested before. However, as not every
    # player has played all variants, one only has to create the number of combinations
    # necessary for that player
    for i in range(len(tmp.loc[:, 'no_of_games'])):
        # As now the game_variant is a column and not the index, the first part of zip
        # is slightly adapted. This loops over all combinations of variants for the
        # current account.
        for comb, combsum in zip(combinations(tmp.loc[:, 'game_variant'], i+1), combinations(tmp.loc[:, 'no_of_games'].values, i+1)):
            # Here, each variant combination gets a unique key. Comb is sorted, as the
            # variants might be not in alphabetic order. The number of games played for
            # each variant for that player are added to the value of all players before.
            games_played_2['_'.join(sorted(comb))] += sum(combsum)

pprint (games_played_2)

# returns
>> defaultdict(<class 'int'>,
            {'a': 5,
             'a_b': 6,
             'a_b_c': 7,
             'a_c': 3,
             'b': 9,
             'b_c': 11,
             'c': 4})

由于您已经提取了各种变体玩的游戏数量,因此可以简单地将它们加起来。如果您想自动执行此操作,则可以在循环中对该循环使用itertools.combinations,该循环会迭代所有可能的组合长度:

from itertools import combinations
import pandas
import numpy as np
from pprint import pprint  # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+')  # assuming the data frame is in a file df.csv

data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

games_played = {}

for i in range(len(data_agg.loc[:, 'no_of_games'])):
    for comb, combsum in zip(combinations(data_agg.index, i+1), combinations(data_agg.loc[:, 'no_of_games'].values, i+1)):
        games_played['_'.join(comb)] = sum(combsum)

pprint(games_played)

返回:

>> {'a': array([5], dtype=int64),
>>  'a_b': array([14], dtype=int64),
>>  'a_b_c': array([18], dtype=int64),
>>  'a_c': array([9], dtype=int64),
>>  'b': array([9], dtype=int64),
>>  'b_c': array([13], dtype=int64),
>>  'c': array([4], dtype=int64)}

'combinations(sequence, number)'返回numbersequence个元素的所有组合的迭代器。因此,要获得所有可能的组合,必须将所有numbers1迭代到len(sequence。这是第一个for循环的作用。

下一个for循环由两个迭代器组成:一个在聚合数据的索引(combinations(data_agg.index, i+1))上,一个在每个变体中实际玩游戏的数量(combinations(data_agg.loc[:, 'no_of_games'].values, i+1) )。因此comb应该始终是变体的列表,并汇总每个变体的游戏数量列表。请注意,要获取所有值,您必须使用.loc[:, 'no_games']而不是.loc['no_games'],因为后者会搜索名为'no_games'的索引,而该索引是列名。

然后,将字典的键设置为变量列表的组合字符串,并将其值设置为所玩游戏数量的元素之和。