我正在尝试构建一个正则表达式,以将很长的文本字符串拆分为多个组件。这是一个示例:
12345 2018-01-03 15:24:12 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**
到目前为止,我有这样的事情:
import re
example = r"12345 2018-01-03 6789 STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**"
str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
"(?P<firstpart>START.*\*{2})"
m = re.search(str_regex, example)
m.group('id1')
## '12345'
# This is OK
m.group('adate')
## '2018-01-03'
# This is OK
m.group('id2')
## '6789'
# This is OK
m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**STARTconsectetur adipiscing elit**'
# This is where I'm lost: I'd like to match the text until the first occurrence of '**'
当我尝试向正则表达式添加第二部分时,它停止工作:
str_regex = "(?P<id1>\d+) (?P<adate>\d{4}-\d{2}-\d{2}) (?P<id2>\d+) " + \
"(?P<firstpart>START.*\*{2})" + \
"(?P<secondpart>START.*\*{2}"
m = re.search(str_regex, example) # This doesn't match anything anymore!
我想要得到的是一个正则表达式,它将示例字符串拆分如下:
str_regex = "..." # What to put here to split the string?
m = re.search(str_regex, example)
m.group('id1')
## '12345'
m.group('adate')
## '2018-01-03'
m.group('id2')
## '6789'
m.group('firstpart')
## 'STARTlorem ipsum dolor sit amet**'
m.group('secondpart')
## 'STARTconsectetur adipiscing elit**'
注意:
*
)。id1
,date
和id2
)和两个内容部分(firstpart
和secondpart
)。其他任何事情都可以(并将被忽略)你能指出我正确的方向吗?