如何在python请求响应

时间:2018-04-12 03:37:57

标签: html beautifulsoup python-requests

使用python请求和漂亮的汤,如果响应中可能返回多个块(或删除我不想要的内容),如何选择正确的html块?

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())

第一次针对目标运行此脚本时,r.text的内容为:

 <html>
 <head>
  <script language="Javascript">
   top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
   nothing to see here
  </script>
 </head>
</html>
<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

脚本返回(非预期):

 <html>
 <head>
  <script language="Javascript">
   top.topFrame.document.location.href="../cgi/navigation_frame.cgi";
   nothing to see here
  </script>
 </head>
</html>
<!-- cgi_interesting -->

如果随后调用脚本,则不存在第一个块并输出有趣的内容; r.text看起来像这样:

<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

脚本返回(按预期):

<!-- cgi_interesting -->
<html>
 <head>
  <meta content="stuff"/>
  <link href="things"/>
 </head>
 <body bgcolor="#FFFFFF">
  <script language="Javascript">
  </script>
interesting content
</body>  
</html>

如果之前未查询目标,则r.text中都存在这两个块。似乎beautifulsoup只处理它找到的第一个块。

无论第一个块是否存在,我都希望代码能够工作。如何测试多个r.text块的html,选择合适的块,并将其传递给beautifulsoup?

我目前正在调查使用re.sub删除<!-- cgi_interesting -->之前的任何内容,但还有更好的方法吗?

2 个答案:

答案 0 :(得分:2)

那个html比beautifulsoup可以处理的更无效。伸手给那些写过这样一个有虫车的网站的人!您可以在</html>边界切片缓冲区并多次使用汤:

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
content = r.content

html_blocks = []

# save declarations for all blocks
html_index = content.find(b'<html>')
if html_index >= 0:
    decl = content[:html_index]
    del content[:html_index]

    # find html extents
    while content:

        # find end tag
        extent = content.find(b'</html>')
        if extent >= 0:
            extent += len(b'</html>')
        else:
            # no end tag, hope BS figures it out
            extent = len(content)

        # put in list and delete from input
        html_blocks.append(delc + content[:extent]
        del content[:extent]

        # advance to next html tag
        html_index = content.find(b'<html>')
        if html_index == -1:
            html_index = len(content)
        del content[:html_index]


for block in html_blocks:
    soup = BeautifulSoup(block, "lxml")
    print (soup.prettify())

答案 1 :(得分:0)

不是保留每个html块,而是使用re.sub来删除html注释之前的任何内容,因为它不需要。成功完成了超过60个站点的循环。

url = my_url + "cgi/interesting.cgi"
r = requests.get(url)
result = re.sub("(?s).*?(<!-- cgi_interesting -->)","\\1", r.text, 1, flags=re.DOTALL)
soup = BeautifulSoup(result, "lxml")
#soup = BeautifulSoup(r.text, "lxml")
print (soup.prettify())