Question

在我的代码中，我尝试做的只是在输出字符串中包含字母A，C，T，G，N和U来清理FastA文件。我试图通过正则表达式来实现这一点，如下所示：

newFastA = (re.findall(r'A,T,G,C,U,N',self.fastAsequence)) #trying to extract all of the listed bases from my fastAsequence.
        print (newFastA)

但是，我没有按顺序完成所有基地的出现。我认为我的正则表达式格式不正确，所以如果你能让我知道我做了什么错误，那就太好了。

Answer 1

我完全避免正则表达式。您可以使用str.translate删除您不想要的字符。

from string import ascii_letters

removechars = ''.join(set(ascii_letters) - set('ACTGNU'))

newFastA = self.fastAsequence.translate(None, removechars)

演示：

dna = 'ACTAGAGAUACCACG this will be removed GNUGNUGNU'

dna.translate(None, removechars)
Out[6]: 'ACTAGAGAUACCACG     GNUGNUGNU'

如果您想删除空格，可以将string.whitespace投入removechars。

旁注，上面只适用于python 2，在python 3中还有一步：

from string import ascii_letters, punctuation, whitespace

#showing how to remove whitespace and punctuation too in this example
removechars = ''.join(set(ascii_letters + punctuation + whitespace) - set('ACTGNU'))

trans = str.maketrans('', '', removechars)

dna.translate(trans)
Out[11]: 'ACTAGAGAUACCACGGNUGNUGNU'

Answer 2

print re.sub("[^ACTGNU]","",fastA_string)

与其他百万个答案一起去

或没有重新

print "".join(filter(lambda character:character in set("ACTGUN"),fastA_string)

Answer 3

您需要使用字符集。

re.findall(r"[ATGCUN]", self.fastAsequence)

您的代码会查找LITERAL "A,T,G,C,U,N"，并输出所有出现的内容。正则表达式中的字符集允许搜索类型：“以下任何一种：A，T，G，C，U，{{ 1}}“而不是”以下内容：N“

正则表达式以找到序列中的某些碱基

3 个答案: