我正在尝试清理18000个文档来训练word2vec分类器。样本文档如下:
From: shou@logos.asd.sgi.com (Tom Shou)
Subject: Ford Explorer 4WD - do I need performance axle?
We're considering getting a Ford Explorer XLT with 4WD and we have the
following questions (All we would do is go skiing -- no off-roading):
1. With 4WD, do we need the "performance axle" - (limited slip axle).
Its purpose is to allow the tires to act independently when the tires
are on different terrain.
2. Do we need the all-terrain tires (P235/75X15) or will the
all-season (P225/70X15) be good enough for us at Lake Tahoe?
Thanks,
Tom
-- *(there maybe --- also)*
===========================================================================*(only one of the two boundaries made with ===== may be present(like sometimes only the top boundary is present) in some cases and may differ in length)*
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
===========================================================================
因此,我需要删除页脚部分。我可以删除From和Subject行(使用正则表达式从文档的前两行)。但我无法删除此部分:
--
===========================================================================
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
===========================================================================
现在某些页脚仅包含两个破折号或三个破折号,如:
--
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
或
---
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
或者有时它可以带有_或+而不是====,例如:
--(this maybe --- or may not exist at all but then the below ______ lines will be there)
________________________________________________________________________(this can be + also)
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
_________________________________________________________________________
我不是很擅长正则表达式,但是我尝试使用错误的正则表达式(例如((_|-|=|\+){2,})(.|\n)*
)删除,但是我不认为内容内有-重叠,因此它也删除了我不想要的内容。
就像在内容中一样,一行(第4行)带有2个破折号All we would do is go skiing -- no off-roading):
。因此,它删除了包括和之后的所有内容-我不想要。我只想删除页脚。
所以我想知道我的正则表达式应该是什么样的,或者我应该使用哪种方法来清洁页脚,即使-或---并不总是存在,但是(有一个用______或+++制成的盒子+++或=========),反之亦然。
请帮助 预先感谢
答案 0 :(得分:2)
您可以使用
re.sub(r'(?ms)^[ \t]*([-_=+])\1+.*', '', text)
请参见regex demo
详细信息
(?ms)
-re.M
(^
将匹配行的开头)和re.DOTALL
(.
将匹配任何字符)^
-一行的开头[ \t]*
-零个或多个水平空白(您也可以使用[^\S\r\n]*
)([-_=+])
-第1组:a -
,_
,=
或+
\1+
-与捕获到第1组的字符相同,一次或多次.*
-字符串的其余部分。