Question

我想从this网站上抓取一些新闻链接。为此，我的代码是这样的：

from bs4 import BeautifulSoup
import requests

base = "https://www.philstar.com/business/"
page = requests.get(base)
soup = BeautifulSoup(page.text, "html.parser")

li_box = soup.find_all("href")

links = open("News article links.txt", "w+")

for a in li_box:
    links.write(base+a['href']+"\n")

问题是，它仅找到目标网页上显示的15-16个链接。如果您手动向下滚动到页面底部，则可以看到它加载了更多新闻内容。滚动更多，它将加载更多，依此类推。该代码无法执行此“向下滚动以查看更多”部分。我该如何抓取所有这些新闻（或者说前1000条新闻）？

Answer 1

您必须为此使用Selenium。我已经对您的代码进行了一些修改，它将使您知道如何做。

尝试一下：

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time

browser = webdriver.Chrome('--path--')      # here path of driver if it didn't find it.

base = "https://www.philstar.com/business/"

browser.get(base)

''' to auto scroll page '''
SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = browser.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")


li_box = soup.find_all('a')     # here whatever you want to find
print(li_box)

希望这对您有帮助！ :) 谢谢！

Answer 2

在这种情况下，我可能会考虑使用Selenium。

使用Selenium，您可以使用页面滚动方法，该方法将允许您模拟在网页中滚动的用户行为。有关一些指导，请参见以下内容：

http://selenium-python.readthedocs.io/faq.html#how-to-scroll-down-to-the-bottom-of-a-page http://blog.varunin.com/2011/08/scrolling-on-pages-using-selenium.html

当手动向下滚动时，如何获得网页以编程方式加载内容？

2 个答案: