使用文件在Python

时间:2015-07-15 17:13:53

标签: python search writing

我有两个文件:一个每行有一个单词,另一个有3个;它们看起来像这样:

列出文件:

Gene1
Gene2
Gene3
Gene4

主文件:

Gene8   Gene3   2.1
Gene10  Gene5   3
Gene1   Gene20  2.1
Gene3   Gene2   3.3 
Gene48  Gene95  2

所以我想要的是使用List文件来搜索和提取Master文件中与List匹配的行,并将它们写入第三个新文件中。所以期望的输出是:

新文件:

Gene8   Gene3   2.1
Gene1   Gene20  2.1
Gene3   Gene2   3.3

我尝试过使用正则表达式来使用re.search,但我似乎没有把它弄好,因为它总是在匹配的情况下编写整个文档,而不是单独的匹配行。

我尝试加载文件并将它们转换为字符串并使用double for循环,但它看起来像是逐字逐句匹配,这使输出文件很难管理。

是的,我看到帖子Use Python to search lines of file for list entries,但我无法使其正常工作,结果文件需要更多格式化,这使得流程变得复杂,我似乎丢失了一些信息(列表文件有数千个条目和主文件是数十万行,因此不容易跟踪)。

我来找你,因为我知道应该有一种更有效,更简单的方法,因为它需要多次运行

2 个答案:

答案 0 :(得分:2)

将关键字列表加载到集合中:

keywords = set()
with open(list_file_path) as list_file:
    for line in list_file:
        if line.strip():
            keywords.add(line.strip())

然后遍历主文件中的每一行,拉出包含至少一个关键字的行:

with open(master_file_path) as master_file:
    with open(search_results_path, 'w') as search_results:
        for line in master_file:
            if set(line.split()[:-1]) & keywords:
                search_results.write(line)

答案 1 :(得分:0)

这应该这样做。我使用了您提供的两个示例数据文件,下面的代码提供了您发布的所需输出。如果这个过程经常重复并且您需要加快速度,那么您可能需要考虑使用不同的搜索算法。如果是这种情况,那么请告诉我哪些操作最常见(插入列表,搜索列表,删除列表中的项目),我们可以使用最合适的搜索算法。

# open the list of words to search for
list_file = open('list.txt')

search_words = []

# loop through the words in the search list
for word in list_file:

    # save each word in an array and strip whitespace
    search_words.append(word.strip())

list_file.close()

# this is where the matching lines will be stored
matches = []

# open the master file
master_file = open('master.txt')

# loop through each line in the master file
for line in master_file:

    # split the current line into array, this allows for us to use the "in" operator to search for exact strings
    current_line = line.split()

    # loop through each search word
    for search_word in search_words:

        # check if the search word is in the current line
        if search_word in current_line:

            # if found then save the line as we found it in the file
            matches.append(line)

            # once found then stop searching the current line
            break

master_file.close()


# create the new file
new_file = open('new_file.txt', 'w+')

# loop through all of the matched lines
for line in matches:

    # write the current matched line to the new file
    new_file.write(line)

new_file.close()