Question

我有来自网页的这些数据，我想将两个标题之间的数据从WEB TRAFFIC BLOCK LIST提取到EMAILS。我一直在使用美味的汤，无法找到相关的主题。感谢

<h2>WEB TRAFFIC BLOCK LIST</h2>

<p>Indicators are not a block list.&nbsp; If you feel the need to block web traffic, I suggest the following domain and URLs:</p>

<ul>
    <li>hxxp://209.141.49.93/hello.bin</li>
    <li>carder.bit</li>
    <li>gandcrab2pie73et.onion</li>
</ul>

<p>&nbsp;</p>

<h2>EMAILS</h2>

Answer 1

您可以使用正则表达式：

content = re.search(
    '<h2>WEB TRAFFIC BLOCK LIST</h2>(.*?)<h2>EMAILS</h2>',
    html,
    re.DOTALL
).group(1)

或者使用Beautiful Soup，收集开始和结束标记之间的节点：

soup = BeautifulSoup(html, 'html.parser')
start = soup.find('h2', text='WEB TRAFFIC BLOCK LIST')
end = soup.find('h2', text='EMAILS')
content = ''
item = start.nextSibling

while item != end:
  content += str(item)
  item = item.nextSibling

print(content)

Answer 2

使用Php Html Parser。这是执行此操作的最佳方法。而且，如果您使用正则表达式执行此操作，那么如果页面中包含长数据，那将是最糟糕的情况。

使用python在两个不同的html标签之间刮擦/提取内容

2 个答案: