Question

我有以下数据集：

ID   Group Name   Information
1    A            'Info type1'
1    A            'Info type2' 
2    B            'Info type2' 
2    B            'Info type3' 
2    B            'Info type4'
3    A            'Info type2' 
3    A            'Info type5'
3    A            'Info type2'

最后，我想计算一个特定组处理了多少个项目，并按特定的Info type将它们分组。

第一步，我定义了一个函数，以某种方式过滤特定的info type：

def checkrejcted(strval):
    if strval == 'Info type5':
        return 'Rejected'
    else:
        return 'Not rejected'

下一步，我已将此功能应用于information列：

dataset['CheckRejected'] = dataset['Information'].apply(checkrejcted)

最后，在删除information列之后，我删除了重复项。所以数据集看起来像：

ID   Group Name   CheckRejected
1    A            'Not rejected'
2    B            'Not rejected' 
3    A            'Not rejected'
3    A            'Rejected'

我想知道，是否有一种更聪明的方法来计算特定组名出现的频率并根据Not rejected，Rejected对其进行分组。可能发生的情况是，特定项目可以同时具有information Rejected / Not rejected。很好，因为我假设在计数图中将对这两项都进行计数。

Answer 1

您可以使用地图和fillna进行默认的不匹配操作：

maps = { "'Info type5'": "'Rejected'" } 
or
maps = { "'Info type1'": "'Not Rejected'",   "'Info type2'": "'Not Rejected'" ,  "'Info type3'": "'Not Rejected'" ,  "'Info type4'": "'Not Rejected'", "'Info type5'": "'Rejected'"  } 

df['Information'].map(maps).fillna('Not Rejected')                                                                                                                                 

0    'Not Rejected'
1    'Not Rejected'
2    'Not Rejected'
3    'Not Rejected'
4    'Not Rejected'
5    'Not Rejected'
6        'Rejected'
7    'Not Rejected'

df ['CheckRejected'] = df ['Information']。map（maps）.fillna（“'Not Rejected'”）

   ID Group Name   Information   CheckRejected
0   1          A  'Info type1'  'Not Rejected'
1   1          A  'Info type2'  'Not Rejected'
2   2          B  'Info type2'  'Not Rejected'
3   2          B  'Info type3'  'Not Rejected'
4   2          B  'Info type4'  'Not Rejected'
5   3          A  'Info type2'  'Not Rejected'
6   3          A  'Info type5'      'Rejected'
7   3          A  'Info type2'  'Not Rejected'

df.drop（columns ='Information'）。drop_duplicates（）

   ID Group Name   CheckRejected
0   1          A  'Not Rejected'
2   2          B  'Not Rejected'
5   3          A  'Not Rejected'
6   3          A      'Rejected'

Answer 2

您写过您想计数行。所以可能您需要：

df.groupby(['Group Name', 'Information']).size()

对于您的样本数据，结果为以下 Series

：

Group Name  Information
A           Info type1     1
            Info type2     3
            Info type5     1
B           Info type2     1
            Info type3     1
            Info type4     1
dtype: int64

其MultiIndex包含分组键（两个级别）而值就是出现的次数。

当您松动时，删除duplicetes不能完成工作信息多少次特定组合发生了。

或者，如果您只想统计已拒绝 / 未拒绝案件，则：

映射 Information 列，使用您的函数并创建一个新列，说状态，
按组名称和状态分组。

执行此操作的代码是：

df['Status'] = df.Information.apply(checkrejcted)
df.groupby(['Group Name', 'Status']).size()

过滤熊猫的前n个值

2 个答案: