RegEx捕获多行文本正文

时间:2013-10-17 13:55:37

标签: python regex

所以我有一些看起来像这样的文本文件:

1a  Title
        Subtitle
            Description
1b  Title
        Subtitle A
            Description
        Subtitle B
            Description
2   Title
        Subtitle A
            Description
        Subtitle B
            Description
        Subtitle C
            Description

我正在尝试使用正则表达式捕获“描述”行,这些行由3个制表符缩进。我遇到的问题有时是描述行将换行到下一行并再次缩进3个制表符。这是一个例子:

1   Demo
        Example
            This is the description text body that I am
            trying to capture with regex.

我想在一个组中捕获此文本,最终得到:

This is the description text body that I am trying to capture with regex.

一旦我能够做到这一点,我还想'压平'文档,使每个部分在一行上用字符而不是行和制表符分隔。所以我的示例代码将成为:

1->Demo->->Example->->->This is the description text...

我将在Python中实现这一点,但是非常感谢任何正则表达式的指导!


UPTADE
我更改了展平文本中的分隔符以表明它之前的关系。即; 1个标签->,2个标签->->,3个标签->->->等等。

此外,如果每个标题(部分)有多个字幕(子部分),则以下是展平文本的外观:

  

1a-> Title-&gt; - &gt; Subtitle-&gt; - &gt; - &gt; Description <1b-&gt; Title-&gt; - &gt; Subtitle   A-&gt; - &gt; - &gt;描述
1b-&gt;标题 - &gt; - &gt;字幕B-&gt; - &gt; - &gt;描述
  2->标题 - &gt; - &gt;字幕A-&gt; - &gt; - &gt;描述
2->标题 - &gt; - &gt;字幕   B-&gt; - &gt; - &gt;描述
2->标题 - &gt; - &gt;字幕C-&gt; - &gt; - &gt;描述

基本上只是'重复使用'每个孩子的父母(数字/头衔)(副标题)。

3 个答案:

答案 0 :(得分:2)

您可以在没有正则表达式的情况下执行此操作:

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
\t\tSep
\t\t\tAnd Another Section
\t\t\tOn two lines
'''

cap=[]
buf=[]
for line in txt.splitlines():
    if line.startswith('\t\t\t'):
        buf.append(line.strip())
        continue
    if buf:    
        cap.append(' '.join(buf))
        buf=[]
else:
    if buf:    
        cap.append(' '.join(buf))      

print cap

打印:

['This is the description text body that I am trying to capture with regex.', 
 'And Another Section On two lines']

优点是分别用3个标签缩进的不同部分保持可分离。


好的:这是regex的完整解决方案:

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
2\tSecond Demo
\t\tAnother Section
\t\t\tAnd Another 3rd level Section
\t\t\tOn two lines
3\tNo section below
4\tOnly one level below
\t\tThis is that one level
'''

import re

result=[]
for ms in re.finditer(r'^(\d+.*?)(?=^\d|\Z)',txt,re.S | re.M):
    section=ms.group(1)
    tm=map(len,re.findall(r'(^\t+)', section, re.S | re.M))
    subsections=max(tm) if tm else 0
    sec=[re.search(r'(^\d+.*)', section).group(1)]
    if subsections:
        for i in range(2,subsections+1):
            lt=r'^{}([^\t]+)$'.format(r'\t'*i)
            level=re.findall(lt, section, re.M)
            sec.append(' '.join(s.strip() for s in level))

    print '->'.join(sec)

打印:

1   Demo->Example->This is the description text body that I am trying to capture with regex.
2   Second Demo->Another Section->And Another 3rd level Section On two lines
3   No section below
4   Only one level below->This is that one level

限制:

1) This is limited to the format you described.
2) It will not handle reverse levels properly:
    1 Section 
         Second Level
             Third Level
         Second Level Again       <== This would be jammed in with 'second level'
    How would you handel multi levels?

3) Won't handle multiline section headers:

    3    Like
         This

在完整示例上运行此代码:

1a  Title->Subtitle->Description Second Line of Description
1b  Title->Subtitle A Subtitle B->Description Description
2   Title->Subtitle A Subtitle B Subtitle C->Description Description Description

您可以看到第二级和第三级是连接,但我不知道您希望如何处理该格式。

答案 1 :(得分:0)

这个怎么样?

re.findall(r'(?m)((?:^\t{3}.*?\n)+)', doc)

它还会捕获标签和换行符,但可以在以后删除它们。

答案 2 :(得分:0)

使用re python2:

text = "yourtexthere"
lines = re.findall("\t{3}.+", text)

没有标签"\t"

text = "yourtexthere"
lines = [i[3:] for i in re.findall("\t{3}.+", text)]

获得最终输出:

...<br>
"\n".join(lines)

<击>


修正:

还不是很好,但我正在研究它:

import re
text = "..."
out = [i for i in re.findall("\t{2,3}.+", text.replace("    ", "\t"))]
fixed = []
sub = []
for i in out:
    if not i.startswith("\t"*3):
        if sub: fixed.append(tuple(sub)); sub = []
    else:
        sub.append(i)
if sub:
    fixed.append(tuple(sub))
print fixed