如何在python中使用正则表达式从许多文档中删除相似的页脚

时间:2018-11-05 10:38:30

标签: python regex

我正在尝试清理18000个文档来训练word2vec分类器。样本文档如下:

From: shou@logos.asd.sgi.com (Tom Shou)
Subject: Ford Explorer 4WD - do I need performance axle?

We're considering getting a Ford Explorer XLT with 4WD and we have the
following questions (All we would do is go skiing -- no off-roading):

1. With 4WD, do we need the "performance axle" - (limited slip axle).
Its purpose is to allow the tires to act independently when the tires
are on different terrain. 

2. Do we need the all-terrain tires (P235/75X15) or will the
all-season (P225/70X15) be good enough for us at Lake Tahoe?


Thanks,


Tom

-- *(there maybe --- also)*


===========================================================================*(only one of the two boundaries made with ===== may be present(like sometimes only the top boundary is present)  in some cases and may differ in length)* 

        Tom Shou            Silicon Graphics
    shou@asd.sgi.com        2011 N. Shoreline Blvd. 
    415-390-5362            MS 8U-815 
    415-962-0494 (fax)      Mountain View, CA 94043

===========================================================================

因此,我需要删除页脚部分。我可以删除From和Subject行(使用正则表达式从文档的前两行)。但我无法删除此部分:

-- 


    ===========================================================================

            Tom Shou            Silicon Graphics
        shou@asd.sgi.com        2011 N. Shoreline Blvd. 
        415-390-5362            MS 8U-815 
        415-962-0494 (fax)      Mountain View, CA 94043

    ===========================================================================

现在某些页脚仅包含两个破折号或三个破折号,如:

-- 


            Tom Shou            Silicon Graphics
        shou@asd.sgi.com        2011 N. Shoreline Blvd. 
        415-390-5362            MS 8U-815 
        415-962-0494 (fax)      Mountain View, CA 94043

  --- 


                Tom Shou            Silicon Graphics
            shou@asd.sgi.com        2011 N. Shoreline Blvd. 
            415-390-5362            MS 8U-815 
            415-962-0494 (fax)      Mountain View, CA 94043

或者有时它可以带有_或+而不是====,例如:

   --(this maybe --- or may not exist at all but then the below ______ lines will be there) 


 ________________________________________________________________________(this can be + also)

                Tom Shou            Silicon Graphics
            shou@asd.sgi.com        2011 N. Shoreline Blvd. 
            415-390-5362            MS 8U-815 
            415-962-0494 (fax)      Mountain View, CA 94043

 _________________________________________________________________________

我不是很擅长正则表达式,但是我尝试使用错误的正则表达式(例如((_|-|=|\+){2,})(.|\n)*)删除,但是我不认为内容内有-重叠,因此它也删除了我不想要的内容。 就像在内容中一样,一行(第4行)带有2个破折号All we would do is go skiing -- no off-roading):。因此,它删除了包括和之后的所有内容-我不想要。我只想删除页脚。

所以我想知道我的正则表达式应该是什么样的,或者我应该使用哪种方法来清洁页脚,即使-或---并不总是存在,但是(有一个用______或+++制成的盒子+++或=========),反之亦然。

请帮助 预先感谢

1 个答案:

答案 0 :(得分:2)

您可以使用

re.sub(r'(?ms)^[ \t]*([-_=+])\1+.*', '', text)

请参见regex demo

详细信息

  • (?ms)-re.M^将匹配行的开头)和re.DOTALL.将匹配任何字符)
  • ^-一行的开头
  • [ \t]*-零个或多个水平空白(您也可以使用[^\S\r\n]*
  • ([-_=+])-第1组:a -_=+
  • \1+-与捕获到第1组的字符相同,一次或多次
  • .*-字符串的其余部分。