如何使用Scrapy获取一定数量的Wikipedia URL?

时间:2018-02-02 23:49:54

标签: python scrapy

我正在使用此Python代码从维基百科获取10000个URL:

class WikipediaCrawler(scrapy.Spider):
    name = "wikipedia-crawler"
    start_urls = ['https://en.wikipedia.org/wiki/Special:Random']

    def start_requests(self):
        for page_counter in range(0, 10000):
            yield scrapy.Request(url=self.start_urls[0], callback=self.save_url)

        for page_counter in range(0, 10000):
            yield scrapy.Request(url=self.start_urls[page_counter], callback=self.parse)

    def parse(self, response):
        urls = []

        for link in response.css('a::attr(href)'):
            urls.append(link.extract())

        file_name = response.url.split("/")[-1] + '.html'
        file_name = file_name.replace(':', '_')

        with open('crawled/' + file_name, 'wb') as f:
            f.write(response.body)

        yield {
            str(response.url):
            {
                'ranking': 5,
                'links': urls
            }
        }

    def save_url(self, response):
        self.start_urls.append(response.url)

它不起作用。它只处理一页。

我正在使用https://en.wikipedia.org/wiki/Special:Random网址获取随机的维基百科页面,从而获取其网址。

2 个答案:

答案 0 :(得分:1)

您正在产生相同的start_urlurl=self.start_urls[0] 10000个关系。我认为您应该删除start_urls变量并将start_requests更改为:

 def start_requests(self):
     for page_counter in range(0, 10000):
         yield scrapy.Request(url='https://en.wikipedia.org/wiki/Special:Random')

我对the docs的理解是,scrapy会将start_requests用作生成器,而不是start_urls作为列表。

所以在内部我猜它有点像

starts = start_requests()
parse(next(starts))

答案 1 :(得分:1)

dont_filter=True参数解决了问题:

https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

  

dont_filter boolean ) - 表示此请求不应该是   由调度程序过滤。当您想要执行时使用此选项   多次相同的请求,忽略重复过滤器。使用   它小心翼翼,否则你会陷入爬行循环。默认为False

代码已更新:

import crapy    


class WikipediaCrawler(scrapy.Spider):
    name = "wikipedia-crawler"
    start_urls = ['https://en.wikipedia.org/wiki/Special:Random']

    def start_requests(self):

        for page_counter in range(0, 10000):
            yield scrapy.Request(url=self.start_urls[0], callback=self.parse, dont_filter=True)

    def parse(self, response):
        urls = []

        for link in response.css('a::attr(href)'):
            urls.append(link.extract())

        file_name = response.url.split("/")[-1] + '.html'
        file_name = file_name.replace(':', '_')

        with open('crawled/' + file_name, 'wb') as f:
            f.write(response.body)

        yield {
            str(response.url):
            {
                'ranking': 5,
                'links': urls
            }
        }