我正在尝试对excel列进行排序以显示重复的邮政编码。如果有重复项,我试图让熊猫从重复的邮政编码中查找一列,对值求和,并使用重复/求和的值创建一个新列表。目前,我能够创建所有重复项的列表,但是对于下一步需要采取的步骤却一无所知。感谢您的帮助,因为我是编码的新手。
下面的示例代码:
from collections import Counter
df = pd.read_excel(r'L:\FixedIncomeReport.xlsx')
zip_code = df['Zip']
quantity = df['Quantity']
Pair = list(zip(zip_code, quantity))
dups=[]
zipcount= list(Counter(i[0] for i in Pair).items())
#print(zipcount)
for i in zipcount:
if i[1] > 1 :
dups.append(i[0])
def variable(element):
if (element in dups):
return True
else:
return False
filtered = filter(variable, (i[0] for i in Pair))
for item in filtered:
print(item)
if item in (i[0] for i in Pair):
print(list(i[1] for i in Pair))
答案 0 :(得分:0)
因此,获取有关pandas数据框中重复项的信息的一种方法是使用groupby函数。您可以按邮政编码对数据框进行分组,并计算出现的次数,同时对数量字段求和。
在下面的代码中,我创建了一个包含10个邮政编码的简单数据框及其各自的数量,其中一些邮政编码是重复的。然后,代码执行分组,过滤重复的邮政编码并输出我认为您需要的两个列表。
import pandas as pd
## create sample dataframe
df = pd.DataFrame({'Zip':['11111','00000','00001','11001','00000','11100','11111','00110','11011','00010'],
'Quantity':[3,6,2,6,5,8,9,0,1,4]
})
## group dataframe by Zip, count the number of occurrences and sum the Quantity field
grouped_df = df.groupby('Zip')['Quantity'].agg(['sum','count']).reset_index()
## output the duplicated zipcodes as a dataframe with the number of occurrences and sum of quantity
duplicated_df = grouped_df[grouped_df['count']>1]
duplicated_df.columns = ['DuplicateZip','SumOfQuantity','NumOfOccurrences']
## output the duplicated zipcodes as a list
duplicated_zipcodes_list = list(grouped_df[grouped_df['count']>1]['Zip'])
## output the sum of quantities for duplicated zipcodes as a list
duplicated_zipcodes_quantitysum_list = list(grouped_df[grouped_df['count']>1]['sum'])