我的文字看起来像这样:
TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
TTL1 | TTL2 |
TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
some text in a line5
TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
...
解释:我有标题行,有时可以分隔成多行,然后我有很多其他行。 我希望捕获所有标题(即使它们在不同的行中),并且还在一个组中捕获标题之后的所有行。
我正在使用多行标题和多行内容,我不知道如何使用正则表达式和python来提取它。
和想法好吗?
答案 0 :(得分:1)
你可以试试这个:
\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\s*\n([^\|]*)(?:\n|$)
根据op的评论,奇怪的是这些行可以包含|因此难以区分标题和行,因此可以尝试以下解决方案:
^\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\n(.*?)(?=^\s*\w+\s*\n*\|\s*\n*\w+\s*\n*\|\s*\n*\w+\s*\n*)|^\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\n(.*)$
示例代码:
import re
regex = r"\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\s*\n([^\|]*)(?:\n|$)"
test_str = ("TTL1 | TTL2 | TTL3\n"
"some text in a line1\n"
"some text in a line2\n"
"some text in a line3\n"
"TTL1 | TTL2 | \n"
"TTL3\n"
"some text in a line1\n"
"some text in a line2\n"
"some text in a line3\n"
"some text in a line4\n"
"some text in a line5\n"
"TTL1 | TTL2 | TTL3\n"
"some text in a line1\n"
"some text in a line2\n"
"some text in a line3\n"
"some text in a line4")
matches = re.finditer(regex, test_str, re.DOTALL)
for matchNum, match in enumerate(matches):
print(match.group(1))
print(match.group(2))
print(match.group(3))
print(match.group(4))
示例输出:
TTL1
TTL2
TTL3
some text in a line1
some text in a line2
some text in a line3
TTL1
TTL2
TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
some text in a line5
TTL1
TTL2
TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
答案 1 :(得分:0)
对re.findall()
函数使用以下方法:
# lines.txt is a file containing the initial text from your question
with open('lines.txt', 'r') as fh:
t = fh.read()
items = re.findall(r'([A-Z\d\s|]+)([^A-Z]+)', t)
# 'h' contains header, 'lines' contains the lines related to current header
for h, lines in items:
print(h.replace('\n', ' '), lines, sep='\n')
输出:
TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
some text in a line5
TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4