我正在尝试使用Python读取包含数千个电子邮件地址的CSV文件,然后创建所有重复项的列表。这是我到目前为止的内容:
import csv
input_file='combined.csv'
original_list=[]
duplicate_list=[]
def readcsv(input_file):
ifile = open(combined, "rU")
reader = csv.reader(ifile, delimiter=";")
rownum = 0
for row in reader:
original_list.append (row)
rownum += 1
ifile.close()
original_list.sort()
return original_list
(readcsv(input_file))
seen_set = set()
duplicate_set = set(x for x in original_list if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set
print (duplicate_set)
print (unique_set)
答案 0 :(得分:0)
而不是(出于注释中解释的原因,即使没有TypeError
,它仍然是不好的python):
seen_set = set()
duplicate_set = set(x for x in original_list if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set
实际上,您需要的只是
# first just use set to grab all the possible elements (make lists hashable by
# passing through tuple) -- this is a set comprehension
seen_set = {tuple(x) for x in original_list}
# the duplicates are just ones with counts > 1
duplicate_set = {t for t in seen_set if original_list.count(list(t)) > 1}
unique_set = seen_set - duplicate_set
您的函数也可以简单地写为
def readcsv(input_file):
ifile = open(combined, "rU")
reader = csv.reader(ifile, delimiter=";")
return sorted(reader) # don't mutate global variables!
original_list = readcsv(input_file)