BeautifulSoup:删除空的div标签

时间:2018-07-26 14:58:50

标签: regex python-3.x beautifulsoup python-requests

我正在从网站中提取数据,这就是我从div中提取数据的方式。

page = requests.get("https://en.wikipedia.org/wiki/Web_mining")
soup = bs(page.text, 'html.parser')

flags = re.DOTALL
ptag = re.compile(r'<[^>]*?>', flags)
pdiv = re.compile('<div [^>]*?>(.*?)</div>', flags)

def remove(soup, tagname):
    for tag in soup.findAll(tagname):
        contents = tag.contents
        parent = tag.parent
        tag.extract()
        for tag in contents:
            parent.append(tag)

def main():
    divs = pdiv.findall(soup.prettify())
    remove(soup, "script")
    for i, d in enumerate(divs):
        parts = [s.strip() for s in ptag.split(d)]
        text = '\n'.join(s for s in parts if s)
        print("%d:\n%s\n" % (i, text))

输出:

0:


1:


2:


3:


4:
From Wikipedia, the free encyclopedia

5:


6:


7:

如何删除空的div

更新:添加了网页的URL。根据网页更新了输出

0 个答案:

没有答案