Question

我一直在尝试寻找一种模式，该模式可从下面的> <之间提取HTML：

<li><a href="/web/20151030182314/https://www.wiki.edu/trees/">Forest Trees Green</a></li>

<span class="field-content">Tress, Design &amp; Plants</span></div> 

<h3><a href="http://web.archive.org/web/20151030182501/http://www.latimes.com">Trees</>
<div class="tf-text">
        Trees provide oxygen <a
<h4>Trees</h4>
<span class="field-content">Trees everywhere</span>  </div></li>
  </ul></div>    </div>
<h3 class="secondary-feature-headline">Through European Security Initiative, Stanford focuses on changing trees</h3>

有人有什么建议吗？附言：我无法使用BeautifulSoup

Answer 1

您可以使用BeautifulSoup提取结果，也可以使用普通的正则表达式模块提取文本，

import re
data = re.findall(r'>.*?<', text_content)
for string in data:
     sub = string.replace('>', '').replace('<', '').strip()
     if sub:
         print(sub)

以上文本的输出如下：

Forest Trees Green
Tress, Design & Plants
Trees
Trees
Trees everywhere
Through European Security Initiative, Stanford focuses on changing trees

HTML标签的正则表达式模式

1 个答案: