使用python请求和漂亮的汤,如果响应中可能返回多个块(或删除我不想要的内容),如何选择正确的html块?
url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())
第一次针对目标运行此脚本时,r.text的内容为:
<html>
<head>
<script language="Javascript">
top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
nothing to see here
</script>
</head>
</html>
<!-- cgi_interesting -->
<html>
<head>
<meta content="stuff"/>
<link href="things"/>
</head>
<body bgcolor="#FFFFFF">
<script language="Javascript">
</script>
interesting content
</body>
</html>
脚本返回(非预期):
<html>
<head>
<script language="Javascript">
top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
nothing to see here
</script>
</head>
</html>
<!-- cgi_interesting -->
如果随后调用脚本,则不存在第一个块并输出有趣的内容; r.text看起来像这样:
<!-- cgi_interesting -->
<html>
<head>
<meta content="stuff"/>
<link href="things"/>
</head>
<body bgcolor="#FFFFFF">
<script language="Javascript">
</script>
interesting content
</body>
</html>
脚本返回(按预期):
<!-- cgi_interesting -->
<html>
<head>
<meta content="stuff"/>
<link href="things"/>
</head>
<body bgcolor="#FFFFFF">
<script language="Javascript">
</script>
interesting content
</body>
</html>
如果之前未查询目标,则r.text
中都存在这两个块。似乎beautifulsoup只处理它找到的第一个块。
无论第一个块是否存在,我都希望代码能够工作。如何测试多个r.text
块的html
,选择合适的块,并将其传递给beautifulsoup?
我目前正在调查使用re.sub
删除<!-- cgi_interesting -->
之前的任何内容,但还有更好的方法吗?
答案 0 :(得分:2)
那个html比beautifulsoup可以处理的更无效。伸手给那些写过这样一个有虫车的网站的人!您可以在</html>
边界切片缓冲区并多次使用汤:
url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
content = r.content
html_blocks = []
# save declarations for all blocks
html_index = content.find(b'<html>')
if html_index >= 0:
decl = content[:html_index]
del content[:html_index]
# find html extents
while content:
# find end tag
extent = content.find(b'</html>')
if extent >= 0:
extent += len(b'</html>')
else:
# no end tag, hope BS figures it out
extent = len(content)
# put in list and delete from input
html_blocks.append(delc + content[:extent]
del content[:extent]
# advance to next html tag
html_index = content.find(b'<html>')
if html_index == -1:
html_index = len(content)
del content[:html_index]
for block in html_blocks:
soup = BeautifulSoup(block, "lxml")
print (soup.prettify())
答案 1 :(得分:0)
不是保留每个html块,而是使用re.sub
来删除html注释之前的任何内容,因为它不需要。成功完成了超过60个站点的循环。
url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
result = re.sub("(?s).*?(<!-- cgi_interesting -->)","\\1", r.text, 1, flags=re.DOTALL)
soup = BeautifulSoup(result, "lxml")
#soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())