我需要将多行字符串解析为包含(1)标识符和(2)标识符后面的文本(但在下一个>符号之前)的数据结构。标识符总是在它自己的行上,但文本可以占用多行。
>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
执行后我的数据结构可能是这样的:
id = ['identifier1', 'identifier2', 'identifier3']
和
txt =
['lalalalalalalalalalalalalalalalala',
'bababababababababababababababababa',
'wawawawawawawawawawawawawawawawawa']
似乎我想用正则表达式找到(1)>之后的东西但在回车之前,以及(2)>之间的事情,暂时删除了标识符字符串和EOL,替换为“”。
问题是我将拥有数百个这样的标识符,所以我需要按顺序运行正则表达式。关于如何解决这个问题的任何想法?我在python中工作,但随时可以在你的回复中使用你想要的任何语言。
* 更新1:来自slater的代码越来越接近,但事情仍未按顺序分为id,text,id,text等*
teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
但输出是:
['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3
注意:它需要适用于多行字符串,处理所有\ n的。更好的测试用例可能是:
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''
# First, split the text into relevant chunks
split_text = teststring.split('>')
#see where we are after split
print split_text
#remove spaces that will mess up the partitioning
while '' in split_text:
split_text.remove('')
#see where we are after removing '', before partitioning
print split_text
id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]
#see where we are after partition
print id
print txt
print len(split_text)
print len(id)
当前输出:
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4
答案 0 :(得分:1)
就个人而言,我觉得你应该尽可能少地使用正则表达式。它很慢,难以维护,而且通常难以理解。
那就是说,在python中解决这个问题非常简单。通过“按顺序”运行,我有点不清楚你究竟是什么意思,但请告诉我这个解决方案是否符合您的需求。
# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]
显然,您可以提高代码效率,但如果您只处理数百个标识符,则确实不需要它。
如果要删除可能出现的任何空白条目,可以执行以下操作:
list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']
如果您还有其他问题,请与我们联系。
答案 1 :(得分:1)
除非我误解了这个问题,否则它就像
一样简单for line in your_file:
if line.startswith('>'):
id.append(line[1:].strip())
else:
text.append(line.strip())
编辑:连接多行:
ids, text = [], []
for line in teststring.splitlines():
if line.startswith('>'):
ids.append(line[1:])
text.append('')
elif text:
text[-1] += line
答案 2 :(得分:0)
=============================================== =======================
=============================================== =======================
teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''
i = 0
j = 0
#split the multiline string by line
dsplit = teststring.split('\n')
#the indicies of identifiers
index = list()
for line in dsplit:
if line.startswith('>'):
print line
index.append(i)
j = j + 1
i = i+1
index.append(i) #add this so you get the last block of text
#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
thetext.append("")
for k in range(index[n]+1, index[n+1]):
thetext[n] = thetext[n] + dsplit[k]
thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j
这给出了以下输出:
>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text: ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries: 3
this many index entries: 3