Question

我有一个包含选举比赛结果的大型数据集。我试图获得总区域报告与总区域。

名为precincts的dataFrame示例：

Race          officeID   CandidateId  total_votes  locationID    precinct_name
Mayor         10         705            20         101           111 
Mayor         10         805            30         101           888
Mayor         10         705            20         101           ##ABS 
Mayor         10         805            30         201           222
Mayor         10         705            20         201           888 
Mayor         10         805            30         201           ##ABS
Mayor         10         705            20         301           333 
Mayor         10         805            30         301           888
Mayor         10         805            30         301           ##ABS

我用它来获得投票总数，然后我将一个函数应用于该组，以获得区域报告的数量和总数。

def get_precincts(prec):
        reported = {'reported':prec.total_votes.count(),
                    'total': prec.total_votes.notnull().count()}
        return reported

precincts = precincts.groupby(['officeId','precinct_name'], as_index=False).total_votes.sum()

# Apply the get_results function to build the district reporting data
precincts_reporting = precincts.groupby(['officeId']).apply(get_precincts)

这很好用，但如果有重复的precinctID，则不计算唯一区域的数量。因此，要处理所有#ABS副本，我只需替换它们，如下所示：

precincts['precinct_name'] = precincts['precinct_name'].replace('##ABS', str(precincts['locationID']) + 'ABS')

但是这并没有解决重复的PrecinctID问题 - 其中有很多 - 不可能在每个异常上进行查找和替换。

编辑：我发现错误发生在我使用.count（）函数之前。我在计算它们之前将重复的区域分组，所以在计算之前我需要替换重复的precinct_name。

在pandas中分组时如何处理重复记录

0 个答案: