Question

我是Scrapy的新手。我正在尝试抓取页面https://github.com/rg3/youtube-dl/pull/11272并在其中提交链接（... / commits / ...）。但我被困住了，无法解决这个问题，并且持续了数小时。

我认为我的代码是正确的，但是它总是在随机页面（5-20页）之后停止，并且我不知道为什么，因为它没有任何错误消息。

但我的日志中有些奇怪的地方。它是这样写的：

将不再显示重复项（请参见DUPEFILTER_DEBUG以显示所有重复项）

我还是无法理解重复的请求！因为我的代码似乎没有两次请求一个URL。（请看代码中的commit_links = list(set(commit_links))行）

代码摘要：


提取所有 commits 链接并将其附加到commit_links

请求commit_links中的第一个链接传递commit_links[1:]并将数据提取到其元数据

重复编号2直到commit_links == []

yield并存储提取的数据。

这是我的代码：

class GitTest2(scrapy.Spider):
    name = 'test2'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/rg3/youtube-dl/pull/11272']

    def parse(self, response):
        if 'commits' in response.request.url:
            comments = response.meta.get('comments', [])
            commit_links = response.meta.get('commit_links', [])
            print('**********************************************************************************************', len(commit_links))
            if not commit_links:
                yield comments
            else:
                master = 'https://github.com' + \
                         response.xpath("//a[@data-pjax='#js-repo-pjax-container']/@href").extract()[0]
                date = response.xpath(
                    "//div[@class='commit full-commit prh-commit px-2 pt-2 ']//relative-time/@datetime").extract()[0]
                for box in response.xpath("//div[contains(@class, 'file js-file js-details-container')]"):
                    before, after, code_lines, code_changes = [], [], [], []
                    for line in box.xpath(".//table//tr"):
                        if not line.xpath("./@class") and not line.xpath("./@data-position"):
                            num_b = line.xpath("./td[contains(@class, 'blob-num')][1]/@data-line-number").extract()
                            num_a = line.xpath("./td[contains(@class, 'blob-num')][2]/@data-line-number").extract()
                            code = re.sub(r'</?span.*?>|<br>', "",
                                          line.xpath(".//span[@class='blob-code-inner']").extract()[0])
                            change = line.xpath(
                                ".//td[contains(@class, 'blob-code-marker-cell')]/@data-code-marker").extract()[0]
                            before.append(int(num_b[0]) if num_b else -1)
                            after.append(int(num_a[0]) if num_a else -1)
                            code_lines.append(code)
                            code_changes.append(change)
                        elif not line.xpath("./@data-position") \
                                and line.xpath("./@class").extract()[0] == "js-expandable-line":
                            before += [-2] * 10
                            after += [-2] * 10
                            code_lines += [""] * 10
                            code_changes += [" "] * 10

                    comments['committed_files'].append({
                        'file': master + "/blob/master/" +
                                box.xpath(".//a[@class='link-gray-dark']/@title").extract()[0],
                        'date': date,
                        'before': before,
                        'after': after,
                        'code_lines': code_lines,
                        'code_changes': code_changes})
                    yield scrapy.Request(
                        url=commit_links[0], callback=self.parse,
                        meta={'commit_links': commit_links[1:], 'comments': comments})

        elif 'pull' in response.request.url:
            print("#############################################################################################")
            commit_links = []
            for link in response.xpath("//a[@class='message']/@href").extract():
                commit_links.append("https://github.com" + link)
            commit_links = list(set(commit_links))
            yield scrapy.Request(
                url=commit_links[0], callback=self.parse,
                meta={'commit_links': commit_links[1:], 'comments': {'committed_files': []}})
        else:
            print("====================================================_______________________===================")

这是日志：

2019-03-08 23:54:21 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: github_crawling)
2019-03-08 23:54:21 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.6.6 (default, Sep 12 2018, 18:26:19) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
2019-03-08 23:54:21 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'github_crawling', 'DOWNLOAD_DELAY': 2.7, 'FEED_FORMAT': 'json', 'FEED_URI': 'test.json', 'NEWSPIDER_MODULE': 'github_crawling.spiders', 'SPIDER_MODULES': ['github_crawling.spiders'], 'USER_AGENT': 'github2_crawling (+http://www.yourdomain.com)'}
2019-03-08 23:54:21 [scrapy.extensions.telnet] INFO: Telnet Password: 094191885a579f60
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-08 23:54:21 [scrapy.core.engine] INFO: Spider opened
2019-03-08 23:54:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-08 23:54:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-08 23:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272> (referer: None)
#############################################################################################
2019-03-08 23:54:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/ba5a40054a7ff5adcfb9c2c0b3fb6489ffdf0c37> (referer: https://github.com/rg3/youtube-dl/pull/11272)
********************************************************************************************** 35
2019-03-08 23:54:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/80608898f59969004a63d0b49935a32b3a568b5c> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/ba5a40054a7ff5adcfb9c2c0b3fb6489ffdf0c37)
********************************************************************************************** 34
2019-03-08 23:54:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/cc895cd7125c6a2fe6f1855b0de43fded6abf173> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/80608898f59969004a63d0b49935a32b3a568b5c)
********************************************************************************************** 33
2019-03-08 23:54:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/105faafb48ba5a76fffbec5780ea2222fc2876a2> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/cc895cd7125c6a2fe6f1855b0de43fded6abf173)
********************************************************************************************** 32
2019-03-08 23:54:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/8842f08df3daaac213124ae0c259ae12f3919f60> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/105faafb48ba5a76fffbec5780ea2222fc2876a2)
********************************************************************************************** 31
2019-03-08 23:54:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/2ce996c688e7ca20feb842d898ca38a98af51e1a> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/8842f08df3daaac213124ae0c259ae12f3919f60)
********************************************************************************************** 30
2019-03-08 23:54:47 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://github.com/rg3/youtube-dl/pull/11272/commits/d328b8c6c2a1c984396fedc7ab2b141a11ccbee6> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-03-08 23:54:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/d328b8c6c2a1c984396fedc7ab2b141a11ccbee6> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/2ce996c688e7ca20feb842d898ca38a98af51e1a)
********************************************************************************************** 29
2019-03-08 23:54:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/70d9194719f8f2d0b68aba69d73f13cfa6ab8a2c> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/d328b8c6c2a1c984396fedc7ab2b141a11ccbee6)
********************************************************************************************** 28
2019-03-08 23:54:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/a89d4906e72fd28b0373df00045a03da00838075> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/70d9194719f8f2d0b68aba69d73f13cfa6ab8a2c)
********************************************************************************************** 27
2019-03-08 23:54:55 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-08 23:54:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 9228,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 444040,
 'downloader/response_count': 10,
 'downloader/response_status_count/200': 10,
 'dupefilter/filtered': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 3, 8, 20, 24, 55, 528870),
 'log_count/DEBUG': 11,
 'log_count/INFO': 9,
 'memusage/max': 52969472,
 'memusage/startup': 52969472,
 'request_depth_max': 9,
 'response_received_count': 10,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2019, 3, 8, 20, 24, 21, 463924)}
2019-03-08 23:54:55 [scrapy.core.engine] INFO: Spider closed (finished)

我在哪里错？我怎样才能解决这个问题？任何帮助将不胜感激。

重要提示：设置DOWNLOAD_DELAY> 2.5，以便GitHub不会阻止您的IP。

随机页面无错误后，Scrapy完成

0 个答案: