Question

因此，我正在尝试网络抓取并检查网站中的特定更改，并且该网站具有搜索栏，我需要在其中输入一些内容才能进入特定于的页面我要网上抓取。问题在于，该网站是单页应用程序，在刷新页面后，其URL不会更改，并显示新结果。我尝试使用requests，但由于它依赖于URL，因此未使用...

requests中是否有一个方法或python库可以绕过此问题并使我继续前进？

Answer 1

我的建议是，尝试使用开发者控制台打开页面。输入数据时，检查SPA发送的请求类型（XHR请求符合您的兴趣）。 url地址有效负载格式等。然后模仿网页。使用session创建一个requests对象，获取页面（可能不是强制性的，但这不会造成伤害，所以为什么不这样做），然后将有效负载发送到正确的地址，您将收到数据。可能不会是HTML和更多的JSON数据，但这会更好，因为以后使用起来会更容易。如果确实需要HTML版本，则可以在python中绑定到PhantomJS之类的库。您可以使用它们来呈现页面，然后检查特定元素的存在。您也可以使用selenium这是一个库，可让您控制浏览器。您甚至可以观看它的工作情况。它使用您现有的浏览器，因此可以处理任何种类的SPA网页或其他。这完全取决于您的需求。如果您想获得纯数据，那么如果您想模仿用户，那么我会选择第一个解决方案，那么selenium到目前为止是最简单的。

下面是硒的使用示例，请从他们的网站上发掘。

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# go to the google home page
driver.get("http://www.google.com")

# the page is ajaxy so the title is originally this:
print driver.title

# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")

# type in the search
inputElement.send_keys("cheese!")

# submit the form (although google automatically searches now without submitting)
inputElement.submit()

try:
    # we have to wait for the page to refresh, the last thing that seems to be updated is the title
    WebDriverWait(driver, 10).until(EC.title_contains("cheese!"))

    # You should see "cheese! - Google Search"
    print driver.title

finally:
    driver.quit()

Python：在单页应用程序中进行网页抓取和检测更改的任何方法？

1 个答案: