Question

我正在使用请求和BeatifulSoup从网页中检索和提取信息。

但是，当我调用requests.get（url）然后打印出文本结果时，它与我在“检查元素”时所看到的不一样。在网页上。缺少HTML代码的多个部分，一些标签有＆＃34; Loading＆＃34;在跨度等。

我怀疑这意味着requests.get（）函数在完全加载之前从页面中提取数据。

有没有办法阻止这种情况？

感谢。

Answer 1

As mentioned in the comments, what you are seeing in the browser via inspecting is HTML which may have been rendered with javascript.

Your code:

import signal
import sys
import time

def interrupt_handler(sig, frame):
    sys.exit(1)

signal.signal(signal.SIGINT, interrupt_handler)

time.sleep(100)

Is the raw response from the server. The javascript has not yet rendered and provided you with dynamically created HTML.

As mentioned in the comments, if you require your program to render this page you may want to try Selenium, PhantomJS, QT4, or Ghost.

Selenium: https://pypi.python.org/pypi/selenium

PhantomJS: https://github.com/elias-winberg/phantomjs-python

Ghost: http://jeanphix.me/Ghost.py/

Scraping with QT4: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

Python请求数据错误？

1 个答案: