我一直在尝试寻找一种模式,该模式可从下面的> <之间提取HTML:
<li><a href="/web/20151030182314/https://www.wiki.edu/trees/">Forest Trees Green</a></li>
<span class="field-content">Tress, Design & Plants</span></div>
<h3><a href="http://web.archive.org/web/20151030182501/http://www.latimes.com">Trees</>
<div class="tf-text">
Trees provide oxygen <a
<h4>Trees</h4>
<span class="field-content">Trees everywhere</span> </div></li>
</ul></div> </div>
<h3 class="secondary-feature-headline">Through European Security Initiative, Stanford focuses on changing trees</h3>
有人有什么建议吗?附言:我无法使用BeautifulSoup
答案 0 :(得分:0)
您可以使用BeautifulSoup提取结果,也可以使用普通的正则表达式模块提取文本,
import re data = re.findall(r'>.*?<', text_content) for string in data: sub = string.replace('>', '').replace('<', '').strip() if sub: print(sub)
以上文本的输出如下:
Forest Trees Green Tress, Design & Plants Trees Trees Trees everywhere Through European Security Initiative, Stanford focuses on changing trees