如何使用python中的另一个文件的值过滤文件

时间:2014-07-09 15:47:43

标签: python file filter bioinformatics

所以我有一个名为sequence.txt的文件,我已经将文件拆分成列表了,它看起来像这样:

原始文件:

  

102L序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

     

103L序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

     

104L序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

我将它们分成列表后:

['>102L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']

['>103L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']

['>104L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']

我有另一个名为title.txt的文件,其中包含我想要的序列的所有名称/标题,它看起来像这样:

>102L
>104L

所以我基于这个title.txt文件,我想过滤掉标题列表中没有标题的所有序列,并将它们存储到另一个名为filter_sequence.txt的文件中。因此新文件的结果应如下所示:

  

102L序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

     

104L序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

注意到没有103L了。我正在使用python,我不知道如何解决这个问题。谁能帮我?谢谢!

这是我的最终代码:

import string

fin = open('title.txt')
all_titles = fin.readlines()
fin.close()
all_titles = map(string.strip, all_titles)

f = open('filtered_sequence.txt', 'w')
sequence_list = open('sequence.txt')
for sequence in sequence_list:
    lists = sequence.strip() # Strip the sequence file into lists of sequence
    if lists[0] in all_titles:
        write_string = lists[0] + lists[1] + "\n\n"
        f.write(write_string)

f.close()

title.txt是:

>102L
>104L

sequence.txt是:

102L     Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

103L     Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

104L     Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

我希望我的filtered_sequence.txt看起来像:

102L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
104L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

但filtered_sequence.txt文件为空。你能救我吗?

1 个答案:

答案 0 :(得分:0)

我想将第二个文件存储在列表中,我想是。

import string

f = open("title.txt","r")
all_titles = f.readlines()                # Get the data
f.close()
all_titles = map(string.strip,all_titles) # Strip off newlines.

然后all_titles包含['>102L','>104L']。从那里,只需做一个“列表中的项目”测试:

f = open("filter_sequence.txt","w")       # The file to write to.

for sequence in sequence_list:
  if sequence[0] in all_titles:           # sequence[0] is the sequence title.
    write_string = str(sequence[0]) + ":\nSequence:" + str(sequence[1]) + "\n\n"
    f.write(write_string)                 # Write the string above.

f.close()                                 # Close the file.

那应该做得好。 item in list是一项快速布尔测试,可以查看list中的任何项是否等于item

注意:如果您要编写102L而不是>102L,则可以通过编写sequence[0]来删除sequence[0][1:]的第一个字符。这意味着从字符1(这是第二个字符)开始抓取子字符串并继续到结尾。