Question

我有两个文件。第一个文件包含6个字符键的列表（SA0001，SA1001等）。第二个文件包含日期和金额的列表，其中前六个位置将与第一个文件中的键匹配。我想验证第一个文件中的每个键在第二个文件中至少有一个匹配项。可能有多个匹配是可以的，第二个文件中可能有记录，第一个文件中没有密钥也没关系。所以基本上是循环中的循环。当我想在第一次匹配后突破内部循环时出现问题，因为第二个文件可能非常大。它正确地打印出“找到”消息并中断，但如果它到达第二个文件的末尾而没有找到匹配则不会打印“未找到”消息。到目前为止我的代码是：

unvalues = open("file1.txt", "r")
newfunds = open("file2.txt", "r").readlines()
i = 1
for line in newfunds:
    line = line.strip()
    for line2 in iter(unvalues.readline, ""):
        try:
            if line == line2[:6]:
                print "%s: Matching %s to %s for date %s" % (i, line, line2[:6], line2[6:14])
                break
        except StopIteration: print "%s: No match for %s" % (i, line)
    i += 1
    unvalues.seek(0)

Answer 1

改用套装：

set1=set(line[:6] for line in open('file1.txt'))
set2=set(line[:6] for line in open('file2.txt'))
not_found = set1 - set2
if not_found:
    print "Some keys not found: " + ', '.join(not_found)

Answer 2

first_file=open("file1.txt","r")
#save all items from first file into a set
first_file_items=set(line.strip() for line in first_file)
second_file=open("file2.txt","r")
for line in second_file:
   if line[:6] in first_file_items:
       #if this is item from the first file, remove it from the set
       first_file_items.remove(line[:6])
       #when nothing is left in the set, we found everything
       if not first_file_items: break

if first_file_items:
   print "Elements in first file but not in second", first_file_items

Answer 3

我不认为休息;抛出一个StopIteration。

您通常不希望像流程控制那样使用异常。

Answer 4

浏览每个文件一次，将每条记录添加到值等于1的哈希值。然后确保第一个哈希值的键是第二个哈希值的子集。

hashes = []
for f in ["file1.txt","file2.txt"]:
    lines = open(f,"r").readlines()
    hash = {}
    for line in lines:
        hash[line[:6] = 1
    hashes.append(hash)

set_keys1 = set(hashes[0].keys())
set_keys2 = set(hashes[1].keys())
assert(set_keys1.issubset(set_keys2))

Answer 5

我认为这可能更接近你想要的东西：

unvalues = dict((line[:6], line[6:14]) for line in open("file1.txt", "r"))
newfunds = [line for line in open("file2.txt", "r")]
for i, line in enumerate(newfunds):
    key = line.strip()
    if key in unvalues:
        v = unvalues[key]
        print "%s: Matching %s to %s for date %s" % (i+1, line, key, v)
    else:
        print "%s: No match for %s" % (i+1, line)

Answer 6

您不能（也不必）捕获迭代器完成时发生的StopIteration异常，因为它会被for循环自动捕获。要执行您尝试执行的操作，可以在for block之后使用else块，例如你可以用这个替换你的内循环：

for line2 in iter(unvalues.readline, ""):
    if line == line2[:6]:
        print "%s: Matching %s to %s for date %s" % (i, line, line2[:6], line2[6:14])
        break
else:
    print "%s: No match for %s" % (i, line)

当for循环结束而没有break语句被命中时，执行else块。

但是，您可能会发现使用集合的其他方法之一更快。

Answer 7

from collections import defaultdict

unvalues = open("file1.txt", "r").readlines()
newfunds = open("file2.txt", "r").readlines()

unvals = defaultdict(int)

for val in unvalues:
    unvals[val] = 0

for line in newfunds:
    line = line.strip()

    if line[:6] in unvals.keys():
        unvals[line[:6]] += 1

for k in unvals.keys():
    if unvals[k] == 0:
        print "Match Not Found For %s" % k

可能会为您想要实现的目标提供一个良好的起点，而不会非常混乱。这为您提供了仅循环遍历每个数据集的性能优势。

作为一个快速的附录，如果你想要行号，而不是在循环外构建一个计数变量并递增它，那么试试这个：

for i, line in enumerate(newfunds):

enumerate（）基本上将一个顺序整数迭代器与您的列表一起使用，以产生所需的结果，而无需进行不必要的计数操作。

Answer 8

使用集合的另一种方法

keys = set(line[:6] for line in open('file.txt'))
missing = set(value[:6] for value in open('file2.txt') if value[:6] not in keys)
if missing:
   print "Keys Missing " + ', '.join(missing)

在文件中搜索第二个文件中的字符串

8 个答案: