我想在Scrapy中制作一个网络抓取工具,以从该网站https://hamariweb.com/news/newscategory.aspx?cat=7提取10000条新闻链接 当我向下滚动更多链接负载时,此页面是动态的。
我用硒尝试过,但是没有用。
import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy import signals
from scrapy.http import HtmlResponse
class WebnewsSpider(scrapy.Spider):
name = 'webnews'
allowed_domains = ['www.hamariweb.com']
start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=7']
def __init__ (self):
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
# options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
self.driver = webdriver. Chrome("C://Users//hammad//Downloads//chrome
driver",chrome_options=options)
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
上述代码以隐身模式打开浏览器,并继续向下滚动。我还想提取10000个新闻链接,并希望在达到限制时停止浏览器。
答案 0 :(得分:0)
您可以通过收集css hrefs添加用于收集URL到parse()方法的逻辑:
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
urls = []
while True:
if len(urls) <= 10000:
for href in response.css('a::attr(href)'):
urls.append(href) # Follow tutorial to learn how to use the href object as you need
else:
break # Exit your while True statement when 10,000 links have been collected
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
在草率教程following links section中,有很多有关如何处理链接的信息。您可以使用那里的信息来学习使用scrapy链接的其他操作。
我还没有使用无限滚动进行测试,因此您可能需要进行一些更改,但这应该可以使您朝正确的方向前进。