Question

我正在从NCBI下载mtDNA记录并尝试使用Python从中提取线条。我试图提取的行要么开头还是包含某些关键字，例如'haplotype'和'nationality'或'locality'。我尝试过以下代码：

import re
infile = open('sequence.txt', 'r') #open in file 'infileName' to read
outfile = open('results.txt', 'a') #open out file 'outfileName' to write

for line in infile:
    if re.findall("(.*)haplogroup(.*)", line):
        outfile.write(line)
        outfile.write(infile.readline())

infile.close()
outfile.close()

这里的输出只包含第一行包含'haplogroup'，例如不包含来自infile的以下行：

                 /haplogroup="T2b20"

我也尝试了以下内容：

keep_phrases = ["ACCESSION", "haplogroup"]

for line in infile:
    for phrase in keep_phrases:
        if phrase in line:
            outfile.write(line)
            outfile.write(infile.readline())

但这并没有给我包含ACCESSION和haplogroup的所有行。

line.startswith可以使用，但是我不能将它用于单词位于行中间的行。

有没有人可以给我一个示例代码来将以下行打印到我的输出中以包含'locality'：

/note="origin_locality:Wales"

关于如何提取包含某些单词的行的任何其他建议也值得赞赏。

编辑：

                 /haplogroup="L2a1l2a"
                 /note="ethnicity:Ashkenazic Jewish;
                 origin_locality:Poland: Warsaw; origin_coordinates:52.21 N
                 21.05 E"
                 /note="TAA stop codon is completed by the addition of 3' A
                 residues to the mRNA"
                 /note="codons recognized: UCN"

在这种情况下，使用Peter的代码，前三行被写入outfile而不是包含21.05 E"的行。如何为/note="制作例外并将所有行复制到第二组引号，而不复制包含/note或/note="TAA /note="codons行>

EDIT2：

这是我目前正在为我工作的解决方案。

stuff_to_write = []
multiline = False
with open('sequences.txt') as f:
    for line in f.readlines():
        if any(phrase in line for phrase in keep_phrases) or multiline:
            do_not_write = False
            if multiline and line.count('"') >= 1:
                multiline = False
            if 'note' in line:
                if any(phrase in line.split('note')[1] for phrase in remove_phrases):
                    do_not_write = True
                elif line.count('"') < 2:
                    multiline = True
            if not do_not_write:
                stuff_to_write.append(line)

Answer 1

这将在文件中搜索匹配的短语，并将这些行写入新文件，假设"note"之后的任何内容与remove_phrases中的任何内容不匹配。

它将逐行读取输入以检查是否与keep_phrases中的单词匹配，将所有值存储在列表中，然后将它们写入单独行中的新文件。除非你需要在找到匹配项的同时逐行编写新文件，否则它应该快得多，因为所有内容都是同时写的。

如果您不希望区分大小写，请将any(phrase in line更改为any(phrase.lower() in line.lower()。

keep_phrases = ["ACCESSION", "haplogroup", "locality"]
remove_phrases = ['codon', 'TAA']

stuff_to_write = []
with open('C:/a.txt') as f:
    for line in f.readlines():
        if any(phrase in line for phrase in keep_phrases):
            do_not_write = False
            if 'note' in line:
                if any(phrase in line.split('note')[1] for phrase in remove_phrases):
                    do_not_write = True
            if not do_not_write:
                stuff_to_write.append(line)

with open('C:/b.txt','w') as f:
    f.write('\r\n'.join(stuff_to_write))

使用Python从txt文件中提取行

1 个答案: