Question

我正在尝试打开两个文件并检查file_1中的第一个单词是否在file_2的任何行中。如果file_1中一行中的第一个单词与file_2中一行中的第一个单词匹配，我想要将两行都打印出来。但是，使用下面的代码我没有得到任何结果。我将处理非常大的文件，所以我想避免使用列表或字典将文件放入内存。我只能使用Python3.3中的内置函数。任何意见，将不胜感激？如果有更好的方法，请同时提出建议。

我正在尝试执行的步骤：

1。）打开file_1 2.）打开file_2 3.）检查第一个Word是否在file_2的任何行中。 4.）如果两个文件中的第一个单词匹配，则打印file_1和file_2中的行。

文件内容：

with open('file_1', 'r') as a, open('file_2', 'r') as b:
    for x, y in zip(a, b):
        if any(x.split()[0] in item for item in b):
            print(x, y)

代码尝试：

('Pears: 10 items in stock', 'Pears: 25 items in stock')

期望的输出：

{{1}}

Answer 1

尝试：

for i in open('[Your File]'):
for x in open('[Your File 2]'):
    if i == x:
        print(i)

Answer 2

我实际上建议不要将数据存储在1GB大小的文本文件中，而不是存储在某种数据库/标准数据存储文件格式中。如果您的数据更复杂，我建议至少使用CSV或某种分隔格式。如果您可以将数据拆分并存储在更小的块中，则可能使用XML，HTML或JSON等标记语言（这将使数据的导航和提取变得容易），这些标记更加有条理并且已经过优化以处理您所处理的内容。重新尝试（找到匹配的键并返回它们的值）。

那就是说，你可以使用＆＃34; readline＆＃34; Python 3文档的第7.2.1节中的方法可以有效地执行您尝试执行的操作：https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-file。

或者，您可以迭代文件：

def _get_key(string, delim):

    #Split key out of string
    key=string.split(delim)[0].strip()
    return key

def _clean_string(string, charToReplace):

    #Remove garbage from string
    for character in charToReplace:
        string=string.replace(character,'')

    #Strip leading and trailing whitespace
    string=string.strip()
    return string

def get_matching_key_values(file_1, file_2, delim, charToReplace):

    #Open the files to be compared
    with open(file_1, 'r') as a, open(file_2, 'r') as b:

    #Create an object to hold our matches
    matches=[]

    #Iterate over file 'a' and extract the keys, one-at-a-time
    for lineA in a:
        keyA=_get_key(lineA, delim)

        #Iterate over file 'b' and extract the keys, one-at-a-time
        for lineB in b:
            keyB=_get_key(lineB, delim)

            #Compare the keys. You might need upper, but I usually prefer 
            #to compare all uppercase to all uppercase
            if keyA.upper()==keyB.upper():
                cleanedOutput=(_clean_string(lineA, charToReplace), 
                               _clean_string(lineB, charToReplace))

                #Append the match to the 'matches' list
                matches.append(cleanedOutput)

        #Reset file 'b' pointer to start of file and try again
        b.seek(0)

    #Return our final list of matches 
    #--NOTE: this method CAN return an empty 'matches' object!
    return matches

这不是最好/最有效的方法：

所有匹配项都保存到内存中的列表对象
没有重复处理
无速度优化
对文件的迭代＆＃39; b＆＃39;发生＆＃39; n＆＃39;时间，地点＆＃39; n＆＃39;是数量文件中的行＆＃39; a＆＃39;。理想情况下，您只需迭代一次文件。

即使只使用基础Python，我也相信有更好的方法可以解决这个问题。

对于要点：https://gist.github.com/MetaJoker/a63f8596d1084b0868e1bdb5bdfb5f16

我认为Gist还有一个指向repl.it的链接。如果你想在浏览器中使用副本，我曾经编写和测试代码。

Python - 打开文件进行比较

2 个答案: