如果重复,则在Python Pandas中返回相应的行值

时间:2019-11-08 20:23:49

标签: python excel python-3.x pandas

我正在尝试对excel列进行排序以显示重复的邮政编码。如果有重复项,我试图让熊猫从重复的邮政编码中查找一列,对值求和,并使用重复/求和的值创建一个新列表。目前,我能够创建所有重复项的列表,但是对于下一步需要采取的步骤却一无所知。感谢您的帮助,因为我是编码的新手。

下面的示例代码:

from collections import Counter

df = pd.read_excel(r'L:\FixedIncomeReport.xlsx')

zip_code = df['Zip']
quantity = df['Quantity']
Pair = list(zip(zip_code, quantity))
dups=[]
zipcount= list(Counter(i[0] for i in Pair).items())


#print(zipcount)
for i in zipcount:
    if i[1] > 1 :
        dups.append(i[0])

def variable(element):
    if (element in dups):
        return True 
    else:
        return False

filtered = filter(variable, (i[0] for i in Pair))


for item in filtered:
    print(item)

    if item in (i[0] for i in Pair):
        print(list(i[1] for i in Pair))

1 个答案:

答案 0 :(得分:0)

因此,获取有关pandas数据框中重复项的信息的一种方法是使用groupby函数。您可以按邮政编码对数据框进行分组,并计算出现的次数,同时对数量字段求和。

在下面的代码中,我创建了一个包含10个邮政编码的简单数据框及其各自的数量,其中一些邮政编码是重复的。然后,代码执行分组,过滤重复的邮政编码并输出我认为您需要的两个列表。

    import pandas as pd

    ## create sample dataframe
    df = pd.DataFrame({'Zip':['11111','00000','00001','11001','00000','11100','11111','00110','11011','00010'],
              'Quantity':[3,6,2,6,5,8,9,0,1,4]
              })

    ## group dataframe by Zip, count the number of occurrences and sum the Quantity field
    grouped_df = df.groupby('Zip')['Quantity'].agg(['sum','count']).reset_index()

    ## output the duplicated zipcodes as a dataframe with the number of occurrences and sum of quantity
    duplicated_df = grouped_df[grouped_df['count']>1]
    duplicated_df.columns = ['DuplicateZip','SumOfQuantity','NumOfOccurrences']

    ## output the duplicated zipcodes as a list
    duplicated_zipcodes_list = list(grouped_df[grouped_df['count']>1]['Zip'])

    ## output the sum of quantities for duplicated zipcodes as a list
    duplicated_zipcodes_quantitysum_list = list(grouped_df[grouped_df['count']>1]['sum'])