我有两个文件:一个每行有一个单词,另一个有3个;它们看起来像这样:
列出文件:
Gene1
Gene2
Gene3
Gene4
主文件:
Gene8 Gene3 2.1
Gene10 Gene5 3
Gene1 Gene20 2.1
Gene3 Gene2 3.3
Gene48 Gene95 2
所以我想要的是使用List文件来搜索和提取Master文件中与List匹配的行,并将它们写入第三个新文件中。所以期望的输出是:
新文件:
Gene8 Gene3 2.1
Gene1 Gene20 2.1
Gene3 Gene2 3.3
我尝试过使用正则表达式来使用re.search,但我似乎没有把它弄好,因为它总是在匹配的情况下编写整个文档,而不是单独的匹配行。
我尝试加载文件并将它们转换为字符串并使用double for循环,但它看起来像是逐字逐句匹配,这使输出文件很难管理。
是的,我看到帖子Use Python to search lines of file for list entries,但我无法使其正常工作,结果文件需要更多格式化,这使得流程变得复杂,我似乎丢失了一些信息(列表文件有数千个条目和主文件是数十万行,因此不容易跟踪)。
我来找你,因为我知道应该有一种更有效,更简单的方法,因为它需要多次运行
答案 0 :(得分:2)
将关键字列表加载到集合中:
keywords = set()
with open(list_file_path) as list_file:
for line in list_file:
if line.strip():
keywords.add(line.strip())
然后遍历主文件中的每一行,拉出包含至少一个关键字的行:
with open(master_file_path) as master_file:
with open(search_results_path, 'w') as search_results:
for line in master_file:
if set(line.split()[:-1]) & keywords:
search_results.write(line)
答案 1 :(得分:0)
这应该这样做。我使用了您提供的两个示例数据文件,下面的代码提供了您发布的所需输出。如果这个过程经常重复并且您需要加快速度,那么您可能需要考虑使用不同的搜索算法。如果是这种情况,那么请告诉我哪些操作最常见(插入列表,搜索列表,删除列表中的项目),我们可以使用最合适的搜索算法。
# open the list of words to search for
list_file = open('list.txt')
search_words = []
# loop through the words in the search list
for word in list_file:
# save each word in an array and strip whitespace
search_words.append(word.strip())
list_file.close()
# this is where the matching lines will be stored
matches = []
# open the master file
master_file = open('master.txt')
# loop through each line in the master file
for line in master_file:
# split the current line into array, this allows for us to use the "in" operator to search for exact strings
current_line = line.split()
# loop through each search word
for search_word in search_words:
# check if the search word is in the current line
if search_word in current_line:
# if found then save the line as we found it in the file
matches.append(line)
# once found then stop searching the current line
break
master_file.close()
# create the new file
new_file = open('new_file.txt', 'w+')
# loop through all of the matched lines
for line in matches:
# write the current matched line to the new file
new_file.write(line)
new_file.close()