列表理解之谜 - Python

时间:2013-10-29 22:04:57

标签: python list-comprehension deduplication

我创建了两个CSV列表。一个是原始CSV文件,另一个是该文件的DeDuped版本。我已将每个读入列表中,并且出于所有意图和目的,它们的格式相同。每个列表项都是一个字符串。

我正在尝试使用列表解析来找出复制删除了哪些项目。原文的长度是16939,DeDupe的列表是15368.这是1571的差异,但我的列表理解长度是368.想法?

deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")

#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")

# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]

编辑:对于评论中的一些注释,这里是对我的列表comp和列表本身的测试。几乎相同的差异,20左右。不是我需要的1500!谢谢!

print len(deduped)
deduped = set(deduped)
print len(deduped)

print len(account_names)
account_names = set(account_names)
print len(account_names)


15368
15368
16939
15387

2 个答案:

答案 0 :(得分:2)

尝试运行此代码并查看其报告的内容。对于collections.Counter,这需要Python 2.7或更高版本,但您可以轻松编写自己的计数器代码,或从另一个答案中复制我的示例代码:Python : List of dict, if exists increment a dict value, if not append a new dict

from collections import Counter

# read in original records
with open("account_names.csv", "rt") as f:
    rows = sorted(line.strip() for line in f)

# count how many times each row appears
counts = Counter(rows)

# get a list of tuples of (count, row) that only includes count > 1
dups = [(count, row) for row, count in counts.items() if count > 1]
dup_count = sum(count-1 for count in counts.values() if count > 1)

# sort the list from largest number of dups to least
dups.sort(reverse=True)

# print a report showing how many dups
for count, row in dups:
    print("{}\t{}".format(count, row))

# get de-duped list
unique_rows = sorted(counts)

# read in de-duped list
with open("account_de_ex.csv", "rt") as f:
    de_duped = sorted(line.strip() for line in f)

print("List lengths: rows {}, uniques {}/de_duped {}, result {}".format(
        len(rows), len(unique_rows), len(de_duped), len(de_duped) + dup_count))

# lists should match since we sorted both lists
if unique_rows == de_duped:
    print("perfect match!")
else:
    # if lists don't match, find out what is going on
    uniques_set = set(unique_rows)
    deduped_set = set(de_duped)

    # find intersection of the two sets
    x = uniques_set.intersection(deduped_set)

    # print differences
    if x != uniques_set:
        print("Rows in original that are not in deduped:\n{}".format(sorted(uniques_set - x)))
    if x != deduped_set:
        print("Rows in deduped that are not in original:\n{}".format(sorted(deduped_set - x)))

答案 1 :(得分:0)

要查看每个列表中的真实内容,您可以继续构建:

如果你只有独特的元素:

deduped = range(15368)
account_names2 = range(15387)
dupes2 = [ele for ele in account_names2 if ele not in deduped] #len is 19

但是,因为您实际上最终会删除已删除但未删除的元素:

account_names =account_names2 + dupes2*18 + dupes2[:7] + account_names2[:1571  - 368]
dupes = [ele for ele in account_names if ele not in deduped] # dupes will have 368 elements