如何通过编号分割字符串?

时间:2018-07-14 16:24:10

标签: python regex string nltk

我希望将以下语料库分为几个部分:

corpus = '1  Write short notes on the anatomy of the Circle of Willis including normal variants.     2  Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.      3  Write short notes on the anatomy of the axis (C2 vertebra).      4  Write short notes on the anatomy of the corpus callosum.      5  Write short notes on the anatomy of the posterior division of the internal iliac artery  6  Write short notes on the anal canal including sphincters.               
      '

进入以下内容:

['Write short notes on the anatomy of the Circle of Willis including normal variants.', 'Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.', 'Write short notes on the anatomy of the axis (C2 vertebra).', 'Write short notes on the anatomy of the posterior division of the internal iliac artery', 'Write short notes on the anal canal including sphincters.']

我写了这个,但是不起作用:

for i in [int(s) for s in corpus.split() if s.isdigit()]:
    answer = corpus.split(str(i))

print(answer)

我该怎么办?

4 个答案:

答案 0 :(得分:3)

对于您的示例数据,您还可以将split上的空白匹配零次或多次,后跟一位或多位数字和空白2次:

*\d+

print (filter(None, re.split(' *\d+  ', corpus)))

Demo

为清楚起见,您可以将空格放在字符类中,后跟一个量词[ ]*\d+[ ]{2}

答案 1 :(得分:0)

您标记了,但提供了非正则表达式解决方案。这是针对您的OP的非正则表达式正确解决方案。

可以使用空格分割,然后将文本部分累加到一个临时变量中,直到遇到下一个数字,然后将该临时部分添加到总体结果中。

由于不可变性,使用列表存储临时(部分)比附加到字符串更有效。

跳过数字本身的存储:

corpus = '1  Write short notes on the anatomy of the Circle of Willis including normal variants.     2  Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.      3  Write short notes on the anatomy of the axis (C2 vertebra).      4  Write short notes on the anatomy of the corpus callosum.      5  Write short notes on the anatomy of the posterior division of the internal iliac artery  6  Write short notes on the anal canal including sphincters.'               

allparts = []  # total result
part = []      # parts that belong to one number
for p in corpus.split():
    if p.isdigit():      # if a number
        if part:             # if stored something
            allparts.append(' '.join(part))   # add it to result
            part=[]
        continue         # skip storing the number  

    part.append(p)      # add to part

if part:   # add rest
    allparts.append(' '.join(part))

print(allparts)

输出:

['Write short notes on the anatomy of the Circle of Willis including normal variants.', 
 'Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.', 
 'Write short notes on the anatomy of the axis (C2 vertebra).', 
 'Write short notes on the anatomy of the corpus callosum.', 
 'Write short notes on the anatomy of the posterior division of the internal iliac artery', 
 'Write short notes on the anal canal including sphincters.']

答案 2 :(得分:0)

使用re.split和列表理解,使用str.strip删除最终的空格:

import re
result = [
    phrase for phrase in map(str.strip, re.split('\d+\s\s', corpus)) if phrase
]

结果:

['Write short notes on the anatomy of the Circle of Willis including normal variants.',
 'Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.',
 'Write short notes on the anatomy of the axis (C2 vertebra).',
 'Write short notes on the anatomy of the corpus callosum.',
 'Write short notes on the anatomy of the posterior division of the internal iliac artery',
 'Write short notes on the anal canal including sphincters.']

答案 3 :(得分:-1)

尝试将re.split()与正则表达式+ strip()结合使用

a = "1  hello.  2  my name is. 3  maat."

answer = [s.strip(" ") for s in filter(None, re.split(" *\d+ ", a))]

print(answer) #['hello.', 'my name is.', 'maat.']

re.split()几乎是split(),但它也包含除雾器 / strip(“”)从s删除空间