如何使用regex / python将顺序组织的多行字符串解析为数据结构?

时间:2014-02-07 08:32:32

标签: python regex parsing

我需要将多行字符串解析为包含(1)标识符和(2)标识符后面的文本(但在下一个>符号之前)的数据结构。标识符总是在它自己的行上,但文本可以占用多行。

>identifier1
lalalalalalalalalalalalalalalalala
>identifier2 
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa

执行后我的数据结构可能是这样的:

id = ['identifier1', 'identifier2', 'identifier3']

txt = 
['lalalalalalalalalalalalalalalalala',
 'bababababababababababababababababa', 
 'wawawawawawawawawawawawawawawawawa']

似乎我想用正则表达式找到(1)>之后的东西但在回车之前,以及(2)>之间的事情,暂时删除了标识符字符串和EOL,替换为“”。

问题是我将拥有数百个这样的标识符,所以我需要按顺序运行正则表达式。关于如何解决这个问题的任何想法?我在python中工作,但随时可以在你的回复中使用你想要的任何语言。

* 更新1:来自slater的代码越来越接近,但事情仍未按顺序分为id,text,id,text等*

teststring = '''>identifier1
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa'''

# First, split the text into relevant chunks
split_text = teststring.split('>')

#see where we are after split
print split_text

#remove spaces that will mess up the partitioning
while '' in split_text:
    split_text.remove('')

#see where we are after removing '', before partitioning
print split_text

id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]

#see where we are after partition
print id
print txt
print len(split_text)
print len(id)

但输出是:

['', 'identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
['identifier1\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa']
3
3

注意:它需要适用于多行字符串,处理所有\ n的。更好的测试用例可能是:

teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala
lalalalalalalalalalalalalalalalala
>identifier2
bababababababababababababababababa
bababababababababababababababababa
>identifier3
wawawawawawawawawawawawawawawawawa
wawawawawawawawawawawawawawawawawa'''

# First, split the text into relevant chunks
split_text = teststring.split('>')

#see where we are after split
print split_text

#remove spaces that will mess up the partitioning
while '' in split_text:
    split_text.remove('')

#see where we are after removing '', before partitioning
print split_text

id = [text.partition(r'\n')[0] for text in split_text]
txt = [text.partition(r'\n')[0] for text in split_text]

#see where we are after partition
print id
print txt
print len(split_text)
print len(id)

当前输出:

['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
['\n', 'identifier1\nlalalalalalalalalalalalalalalalala\nlalalalalalalalalalalalalalalalala\n', 'identifier2\nbababababababababababababababababa\nbababababababababababababababababa\n', 'identifier3\nwawawawawawawawawawawawawawawawawa\nwawawawawawawawawawawawawawawawawa']
4
4

3 个答案:

答案 0 :(得分:1)

就个人而言,我觉得你应该尽可能少地使用正则表达式。它很慢,难以维护,而且通常难以理解。

那就是说,在python中解决这个问题非常简单。通过“按顺序”运行,我有点不清楚你究竟是什么意思,但请告诉我这个解决方案是否符合您的需求。

# First, split the text into relevant chunks
split_text = text.split('>')
id = [text.partition('\n')[0] for text in split_text]
txt = [text.partition('\n')[2] for text in split_text]

显然,您可以提高代码效率,但如果您只处理数百个标识符,则确实不需要它。

如果要删除可能出现的任何空白条目,可以执行以下操作:

list_with_blanks = ['', 'hello', '', '', 'world']
filter(None, list_with_blanks)
>>> ['hello', 'world']

如果您还有其他问题,请与我们联系。

答案 1 :(得分:1)

除非我误解了这个问题,否则它就像

一样简单
for line in your_file:
    if line.startswith('>'):
        id.append(line[1:].strip())
    else:
        text.append(line.strip())

编辑:连接多行:

ids, text = [], []
for line in teststring.splitlines():
    if line.startswith('>'):
        ids.append(line[1:])
        text.append('')
    elif text:
        text[-1] += line

答案 2 :(得分:0)

我找到了解决方案。它肯定不是非常pythonic但它的工作原理。

=============================================== =======================

=============================================== =======================

teststring = '''
>identifier1
lalalalalalalalalalalalalalalalala\n
lalalalalalalalalalalalalalalalala\n
>identifier2
bababababababababababababababababa\n
bababababababababababababababababa\n
>identifier3
wawawawawawawawawawawawawawawawawa\n
wawawawawawawawawawawawawawawawawa\n'''

i = 0
j = 0

#split the multiline string by line
dsplit = teststring.split('\n')

#the indicies of identifiers
index = list()

for line in dsplit:
    if line.startswith('>'):
        print line
        index.append(i)
        j = j + 1
    i = i+1
index.append(i)  #add this so you get the last block of text

#the text corresponding to each index
thetext = list()
#the names corresponding to each gene
thenames = list()
for n in range(0, len(index)-1):
    thetext.append("")
    for k in range(index[n]+1, index[n+1]):
        thetext[n] = thetext[n] + dsplit[k]
    thenames.append(dsplit[index[n]][1:]) # the [1:] removes the first character (>) from the line
print "the indicies", index
print "the text: ", thetext
print "the names", thenames
print "this many text entries: ", len(thetext)
print "this many index entries: ", j

这给出了以下输出:

>identifier1
>identifier2
>identifier3
the indicies [1, 6, 11, 16]
the text:  ['lalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalala', 'babababababababababababababababababababababababababababababababababa', 'wawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawa']
the names ['identifier1', 'identifier2', 'identifier3']
this many text entries:  3
this many index entries:  3