使用正则表达式从平面文件中提取信息时遇到了一些麻烦 (只是文字)。这些文件的结构如下:
ID(例如> YAL001C)
注释/元数据(描述ID来源的短语)
序列(非常长的字符串,例如KRHDE ....平均约500个字母)
我正在尝试仅提取ID和序列(跳过所有元数据)。不幸的是,列表 仅靠操作是不够的,例如
with open("composition.in","rb") as all_info:
all_info=all_info.read()
all_info=all_info.split(">")[1:]
因为文本的元数据/注释部分充斥着'>'那个字符 导致生成的列表结构不正确。列表理解得到了很多 在某一点之后很难看,所以我正在尝试以下方法:
with open("composition.in","rb") as yeast_all:
yeast_all=yeast_all.read() # convert file to string
## Regular expression to clean up rogue ">" characters
## i.e. "<i>", "<sub>", etc which screw up
## the structure of the eveuntual list
import re
id_delimeter = r'^>{1}+\w{7,10}+\s'
match=re.search(id_delimeter, yeast_all)
if match:
print 'found', match.group()
else:
print 'did not find'
yeast_all=yeast_all.split(id_delimeter)[1:]
我只收到一条错误消息“错误:多次重复”
ID类型为:
YAL001C
YGR103W
YKL068W-A
第一个字符始终是“&gt;”,后跟大写字母和数字,有时是破折号 ( - )。我想要一个RE,可用于查找所有这些事件并拆分文本 使用RE作为分隔符以获取ID和序列并省略元数据。我是正则表达式的新手,因此对该主题知之甚少!
注意:三个字段(ID,元数据,序列)中每个字段之间只有一个换行符
答案 0 :(得分:0)
尝试
>(?P<id>[\w-]+)\s.*\n(?P<sequence>[\w\n]+)
您将在组id
中找到ID,并在组sequence
中找到序列。
说明:
> # start with a ">" character
(?P<id> # capture the ID in group "id"
[\w-]+ # this matches any number (>1) of word characters (A to Z, a to z, digits, and _) or dashes "-"
)
\s+ # after the ID, there must be at least one whitespace character
.* # consume the metadata part, we have no interest in this
\n # up to a newline
(?P<sequence> # finally, capture the sequence data in group "sequence"
[\w\n]+ # this matches any number (>1) of word characters and newlines.
)
作为python代码:
text= '''>YKL068W-A
foo
ABCD
>XYZ1234
<><><><>><<<>
LMNOP'''
pattern= '>(?P<id>[\w-]+)\n.*\n(?P<sequence>\w+)'
for id, sequence in re.findall(pattern, text):
print((id, sequence))