我需要匹配html标签之间的所有内容,或者如果有其他方式,请从标签之间获取所有信息。
以下是数据样本:
<B>stuff here</B>
<B>Changes in the taxicab and <FONT STYLE="white-space:nowrap">for-
hire</FONT> vehicle industries have resulted in increased competition and
have had a material adverse effect on our business, financial condition, and
operations. </B>
medallions. </P> <P STYLE="margin-top:12pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman"><B>We borrow money, which magnifies the potential for gain or loss on amounts invested, and may increase the risk of investing in us. </B></P>
这些是我需要从这个小块获得的匹配:
<B>stuff here</B>
<B>Changes in the taxicab and <FONT STYLE="white-space:nowrap">for-
hire</FONT> vehicle industries have resulted in increased competition and
have had a material adverse effect on our business, financial condition, and
operations. </B>
<B>We borrow money, which magnifies the potential for gain or loss on amounts invested, and may increase the risk of investing in us. </B>
以下是我尝试的几个正则表达式,两者都没有达到我希望它工作的程度:
re.compile("<[Bb]>[\!\@\#\$\%\^\&\*\(\)\_\+\-\=\,\.\/\<\?\:\"\;\'\{\}\[\]\|\\\w\d\s]*<\/[Bb]>", re.MULTILINE)
re.compile("<[Bb]>.+<\/[Bb]>", re.MULTILINE)
或者,如果没有正则表达式,还有更好的方法吗?
我目前正在将HTML内容加载到文本文件中以删除缩进
答案 0 :(得分:1)
您可以使用以下模式匹配<B>
代码之间的所有内容:
(?s)(?<=<B>).*(?=<\/B>)
这使用正向前看((?<=<B>)
)和正面看法((?=<\/B>)
)来匹配标签之间的任何内容。