Question

我正在使用漂亮的汤解析设计糟糕的网页。

目前，我需要的是选择网页的评论部分，但每个评论都被视为DIV，每个评论都有一个像“IAMCOMMENT_00001”这样的ID，但就是这样。没有课（这会有很大的帮助）。

所以我被迫搜索以“IAMCOMMENT”开头的所有DIV，但我无法弄清楚如何做到这一点。我能找到的最接近的是SoupStrainer，但无法理解如何使用它。

我怎样才能实现这个目标？

Answer 1

我会使用内置BeautifulSoup's函数的find_all：

from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
soup.find_all('div', id_=re.compile('IAMCOMMENT_'))

Answer 2

如果要解析表单注释，首先需要查找html的注释。一种方法是：

import re
from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(myhtml)
comments = soup.find_all(text=lambda text: isinstance(text, Comment))

在评论中找到div，

for comment in comments:
    cmnt_soup = BeautifulSoup(comment)
    divs = cmnt_soup.find_all('div', attrs={"id": re.compile(r'IAMCOMMENT_\d+')})

    # do things with the divs

我如何只选择ID相似的DIV

2 个答案: