当手动向下滚动时,如何获得网页以编程方式加载内容?

时间:2018-06-28 06:47:24

标签: python html css web-scraping beautifulsoup

我想从this网站上抓取一些新闻链接。为此,我的代码是这样的:

from bs4 import BeautifulSoup
import requests

base = "https://www.philstar.com/business/"
page = requests.get(base)
soup = BeautifulSoup(page.text, "html.parser")

li_box = soup.find_all("href")

links = open("News article links.txt", "w+")

for a in li_box:
    links.write(base+a['href']+"\n")

问题是,它仅找到目标网页上显示的15-16个链接。如果您手动向下滚动到页面底部,则可以看到它加载了更多新闻内容。滚动更多,它将加载更多,依此类推。该代码无法执行此“向下滚动以查看更多”部分。我该如何抓取所有这些新闻(或者说前1000条新闻)?

2 个答案:

答案 0 :(得分:2)

您必须为此使用Selenium。我已经对您的代码进行了一些修改,它将使您知道如何做。

尝试一下:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time

browser = webdriver.Chrome('--path--')      # here path of driver if it didn't find it.

base = "https://www.philstar.com/business/"

browser.get(base)

''' to auto scroll page '''
SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = browser.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")


li_box = soup.find_all('a')     # here whatever you want to find
print(li_box)

希望这对您有帮助! :) 谢谢!

答案 1 :(得分:0)

在这种情况下,我可能会考虑使用Selenium

使用Selenium,您可以使用页面滚动方法,该方法将允许您模拟在网页中滚动的用户行为。有关一些指导,请参见以下内容:

http://selenium-python.readthedocs.io/faq.html#how-to-scroll-down-to-the-bottom-of-a-page http://blog.varunin.com/2011/08/scrolling-on-pages-using-selenium.html