HTML标签的正则表达式模式

时间:2018-07-30 05:28:06

标签: python regex python-3.x

我一直在尝试寻找一种模式,该模式可从下面的> <之间提取HTML:

<li><a href="/web/20151030182314/https://www.wiki.edu/trees/">Forest Trees Green</a></li>

<span class="field-content">Tress, Design &amp; Plants</span></div> 

<h3><a href="http://web.archive.org/web/20151030182501/http://www.latimes.com">Trees</>
<div class="tf-text">
        Trees provide oxygen <a
<h4>Trees</h4>
<span class="field-content">Trees everywhere</span>  </div></li>
  </ul></div>    </div>
<h3 class="secondary-feature-headline">Through European Security Initiative, Stanford focuses on changing trees</h3>

有人有什么建议吗?附言:我无法使用BeautifulSoup

1 个答案:

答案 0 :(得分:0)

您可以使用BeautifulSoup提取结果,也可以使用普通的正则表达式模块提取文本,

import re
data = re.findall(r'>.*?<', text_content)
for string in data:
     sub = string.replace('>', '').replace('<', '').strip()
     if sub:
         print(sub)

以上文本的输出如下:

Forest Trees Green
Tress, Design & Plants
Trees
Trees
Trees everywhere
Through European Security Initiative, Stanford focuses on changing trees