我想从this网站上抓取一些新闻链接。为此,我的代码是这样的:
from bs4 import BeautifulSoup
import requests
base = "https://www.philstar.com/business/"
page = requests.get(base)
soup = BeautifulSoup(page.text, "html.parser")
li_box = soup.find_all("href")
links = open("News article links.txt", "w+")
for a in li_box:
links.write(base+a['href']+"\n")
问题是,它仅找到目标网页上显示的15-16个链接。如果您手动向下滚动到页面底部,则可以看到它加载了更多新闻内容。滚动更多,它将加载更多,依此类推。该代码无法执行此“向下滚动以查看更多”部分。我该如何抓取所有这些新闻(或者说前1000条新闻)?
答案 0 :(得分:2)
您必须为此使用Selenium。我已经对您的代码进行了一些修改,它将使您知道如何做。
尝试一下:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
browser = webdriver.Chrome('--path--') # here path of driver if it didn't find it.
base = "https://www.philstar.com/business/"
browser.get(base)
''' to auto scroll page '''
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")
li_box = soup.find_all('a') # here whatever you want to find
print(li_box)
希望这对您有帮助! :) 谢谢!
答案 1 :(得分:0)
在这种情况下,我可能会考虑使用Selenium。
使用Selenium,您可以使用页面滚动方法,该方法将允许您模拟在网页中滚动的用户行为。有关一些指导,请参见以下内容:
http://selenium-python.readthedocs.io/faq.html#how-to-scroll-down-to-the-bottom-of-a-page http://blog.varunin.com/2011/08/scrolling-on-pages-using-selenium.html