Question

我想在Scrapy中制作一个网络抓取工具，以从该网站https://hamariweb.com/news/newscategory.aspx?cat=7提取10000条新闻链接当我向下滚动更多链接负载时，此页面是动态的。

我用硒尝试过，但是没有用。

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy import signals
from scrapy.http import HtmlResponse

class WebnewsSpider(scrapy.Spider):
   name = 'webnews'
   allowed_domains = ['www.hamariweb.com']
   start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=7']


 def __init__ (self):
    options = webdriver.ChromeOptions()
    options.add_argument("--start-maximized")
   # options.add_argument('--blink-settings=imagesEnabled=false')
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito') 
    self.driver = webdriver. Chrome("C://Users//hammad//Downloads//chrome 
    driver",chrome_options=options)

def parse(self, response):
    self.driver.get(response.url)
    pause_time = 1
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    #start = datetime.datetime.now()

    while True:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
        time.sleep(pause_time)
        print("\n\n\nend\n\n\n")
        new_height = self.driver.execute_script("return document.body.scrollHeight")

上述代码以隐身模式打开浏览器，并继续向下滚动。我还想提取10000个新闻链接，并希望在达到限制时停止浏览器。

Answer 1

您可以通过收集css hrefs添加用于收集URL到parse（）方法的逻辑：

def parse(self, response):
    self.driver.get(response.url)
    pause_time = 1
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    #start = datetime.datetime.now()
    urls = []
    while True:
        if len(urls) <= 10000:
            for href in response.css('a::attr(href)'):
                urls.append(href) # Follow tutorial to learn how to use the href object as you need
        else:
            break # Exit your while True statement when 10,000 links have been collected
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
        time.sleep(pause_time)
        print("\n\n\nend\n\n\n")
        new_height = self.driver.execute_script("return document.body.scrollHeight")

在草率教程following links section中，有很多有关如何处理链接的信息。您可以使用那里的信息来学习使用scrapy链接的其他操作。

我还没有使用无限滚动进行测试，因此您可能需要进行一些更改，但这应该可以使您朝正确的方向前进。

使用Scrapy从动态网页中抓取网址

1 个答案: