我正在从网站中提取数据,这就是我从div
中提取数据的方式。
page = requests.get("https://en.wikipedia.org/wiki/Web_mining")
soup = bs(page.text, 'html.parser')
flags = re.DOTALL
ptag = re.compile(r'<[^>]*?>', flags)
pdiv = re.compile('<div [^>]*?>(.*?)</div>', flags)
def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)
def main():
divs = pdiv.findall(soup.prettify())
remove(soup, "script")
for i, d in enumerate(divs):
parts = [s.strip() for s in ptag.split(d)]
text = '\n'.join(s for s in parts if s)
print("%d:\n%s\n" % (i, text))
输出:
0:
1:
2:
3:
4:
From Wikipedia, the free encyclopedia
5:
6:
7:
如何删除空的div
?
更新:添加了网页的URL。根据网页更新了输出