匹配HTML标记之间的所有内容

时间:2018-01-29 18:51:35

标签: python regex python-2.7 web-scraping multiline

我需要匹配html标签之间的所有内容,或者如果有其他方式,请从标签之间获取所有信息。

以下是数据样本:

<B>stuff here</B>

<B>Changes in the taxicab and <FONT STYLE="white-space:nowrap">for-
hire</FONT>  vehicle industries have resulted in increased competition and  
have had a material adverse effect on our business, financial condition, and 
operations.  </B>


medallions. </P> <P STYLE="margin-top:12pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman"><B>We borrow money, which magnifies the potential for gain or loss on amounts invested, and may increase the risk of investing in us. </B></P>

这些是我需要从这个小块获得的匹配:

<B>stuff here</B>

<B>Changes in the taxicab and <FONT STYLE="white-space:nowrap">for-
hire</FONT>  vehicle industries have resulted in increased competition and  
have had a material adverse effect on our business, financial condition, and 
operations.  </B>

<B>We borrow money, which magnifies the potential for gain or loss on amounts invested, and may increase the risk of investing in us. </B>

以下是我尝试的几个正则表达式,两者都没有达到我希望它工作的程度:

re.compile("<[Bb]>[\!\@\#\$\%\^\&\*\(\)\_\+\-\=\,\.\/\<\?\:\"\;\'\{\}\[\]\|\\\w\d\s]*<\/[Bb]>", re.MULTILINE)
re.compile("<[Bb]>.+<\/[Bb]>", re.MULTILINE)

或者,如果没有正则表达式,还有更好的方法吗?

我目前正在将HTML内容加载到文本文件中以删除缩进

1 个答案:

答案 0 :(得分:1)

您可以使用以下模式匹配<B>代码之间的所有内容:

 (?s)(?<=<B>).*(?=<\/B>)

这使用正向前看((?<=<B>))和正面看法((?=<\/B>))来匹配标签之间的任何内容。