Question

这是我的代码。实际上我跟着“Recursively Scraping Web Pages With Scrapy”中的例子，似乎我在某个地方包含了一个错误。

有人可以帮我找到吗，拜托？它让我发疯，我只想要所有结果页面的所有结果。相反，它给了我第1页的结果。

这是我的代码：

import scrapy

from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from githubScrape.items import GithubscrapeItem


class GithubSpider(CrawlSpider):
    name = "github2"
    allowed_domains = ["github.com"]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[contains(@class, "next_page")]')), callback='parse_items', follow=True),
    )

    def start_requests(self):

        baseURL = 'https://github.com/search?utf8=%E2%9C%93&q=eagle+SYSTEM+extension%3Asch+size%3A'
        for i in range(10000, 20000, +5000):
            url = baseURL+str(i+1)+".."+str(i+5000)+'&type=Code&ref=searchresults'
            print "URL:",url
            yield Request(url, callback=self.parse_items)


    def parse_items(self, response):

        hxs = Selector(response)
        resultParagraphs = hxs.xpath('//div[contains(@id,"code_search_results")]//p[contains(@class, "title")]')

        items = []
        for p in resultParagraphs:
            hrefs = p.xpath('a/@href').extract()
            projectURL = hrefs[0]
            schemeURL = hrefs[1]
            lastIndexedOn = p.xpath('.//span/time/@datetime').extract()

            i = GithubscrapeItem()
            i['counter'] = self.count
            i['projectURL'] = projectURL
            i['schemeURL'] = schemeURL
            i['lastIndexedOn'] = lastIndexedOn
            items.append(i)
        return(items)

Answer 1

我没有在您传递的链接上找到您的代码，但我认为问题在于您从不使用规则。

Scrapy通过调用start_requests方法开始抓取，但规则是在parse方法上编译和使用的，您没有使用该方法，因为您的请求直接从start_requests转到parse_items {1}}。

如果您希望在该级别应用规则，则可以删除callback方法上的start_requests。

crawlSpider似乎不遵循规则

1 个答案: