我是Scrapy的新手。我正在尝试抓取页面https://github.com/rg3/youtube-dl/pull/11272并在其中提交链接(... / commits / ...)。但我被困住了,无法解决这个问题,并且持续了数小时。
我认为我的代码是正确的,但是它总是在随机页面(5-20页)之后停止,并且我不知道为什么,因为它没有任何错误消息。
但我的日志中有些奇怪的地方。它是这样写的:
将不再显示重复项(请参见DUPEFILTER_DEBUG以显示所有重复项)
我还是无法理解重复的请求!因为我的代码似乎没有两次请求一个URL。 (请看代码中的commit_links = list(set(commit_links))
行)
代码摘要:
- 提取所有 commits 链接并将其附加到
commit_links
- 请求
commit_links
中的第一个链接传递commit_links[1:]
并将数据提取到其元数据- 重复编号2直到
commit_links == []
yield
并存储提取的数据。
这是我的代码:
class GitTest2(scrapy.Spider):
name = 'test2'
allowed_domains = ['github.com']
start_urls = ['https://github.com/rg3/youtube-dl/pull/11272']
def parse(self, response):
if 'commits' in response.request.url:
comments = response.meta.get('comments', [])
commit_links = response.meta.get('commit_links', [])
print('**********************************************************************************************', len(commit_links))
if not commit_links:
yield comments
else:
master = 'https://github.com' + \
response.xpath("//a[@data-pjax='#js-repo-pjax-container']/@href").extract()[0]
date = response.xpath(
"//div[@class='commit full-commit prh-commit px-2 pt-2 ']//relative-time/@datetime").extract()[0]
for box in response.xpath("//div[contains(@class, 'file js-file js-details-container')]"):
before, after, code_lines, code_changes = [], [], [], []
for line in box.xpath(".//table//tr"):
if not line.xpath("./@class") and not line.xpath("./@data-position"):
num_b = line.xpath("./td[contains(@class, 'blob-num')][1]/@data-line-number").extract()
num_a = line.xpath("./td[contains(@class, 'blob-num')][2]/@data-line-number").extract()
code = re.sub(r'</?span.*?>|<br>', "",
line.xpath(".//span[@class='blob-code-inner']").extract()[0])
change = line.xpath(
".//td[contains(@class, 'blob-code-marker-cell')]/@data-code-marker").extract()[0]
before.append(int(num_b[0]) if num_b else -1)
after.append(int(num_a[0]) if num_a else -1)
code_lines.append(code)
code_changes.append(change)
elif not line.xpath("./@data-position") \
and line.xpath("./@class").extract()[0] == "js-expandable-line":
before += [-2] * 10
after += [-2] * 10
code_lines += [""] * 10
code_changes += [" "] * 10
comments['committed_files'].append({
'file': master + "/blob/master/" +
box.xpath(".//a[@class='link-gray-dark']/@title").extract()[0],
'date': date,
'before': before,
'after': after,
'code_lines': code_lines,
'code_changes': code_changes})
yield scrapy.Request(
url=commit_links[0], callback=self.parse,
meta={'commit_links': commit_links[1:], 'comments': comments})
elif 'pull' in response.request.url:
print("#############################################################################################")
commit_links = []
for link in response.xpath("//a[@class='message']/@href").extract():
commit_links.append("https://github.com" + link)
commit_links = list(set(commit_links))
yield scrapy.Request(
url=commit_links[0], callback=self.parse,
meta={'commit_links': commit_links[1:], 'comments': {'committed_files': []}})
else:
print("====================================================_______________________===================")
这是日志:
2019-03-08 23:54:21 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: github_crawling)
2019-03-08 23:54:21 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.6.6 (default, Sep 12 2018, 18:26:19) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
2019-03-08 23:54:21 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'github_crawling', 'DOWNLOAD_DELAY': 2.7, 'FEED_FORMAT': 'json', 'FEED_URI': 'test.json', 'NEWSPIDER_MODULE': 'github_crawling.spiders', 'SPIDER_MODULES': ['github_crawling.spiders'], 'USER_AGENT': 'github2_crawling (+http://www.yourdomain.com)'}
2019-03-08 23:54:21 [scrapy.extensions.telnet] INFO: Telnet Password: 094191885a579f60
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-08 23:54:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-08 23:54:21 [scrapy.core.engine] INFO: Spider opened
2019-03-08 23:54:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-08 23:54:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-03-08 23:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272> (referer: None)
#############################################################################################
2019-03-08 23:54:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/ba5a40054a7ff5adcfb9c2c0b3fb6489ffdf0c37> (referer: https://github.com/rg3/youtube-dl/pull/11272)
********************************************************************************************** 35
2019-03-08 23:54:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/80608898f59969004a63d0b49935a32b3a568b5c> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/ba5a40054a7ff5adcfb9c2c0b3fb6489ffdf0c37)
********************************************************************************************** 34
2019-03-08 23:54:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/cc895cd7125c6a2fe6f1855b0de43fded6abf173> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/80608898f59969004a63d0b49935a32b3a568b5c)
********************************************************************************************** 33
2019-03-08 23:54:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/105faafb48ba5a76fffbec5780ea2222fc2876a2> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/cc895cd7125c6a2fe6f1855b0de43fded6abf173)
********************************************************************************************** 32
2019-03-08 23:54:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/8842f08df3daaac213124ae0c259ae12f3919f60> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/105faafb48ba5a76fffbec5780ea2222fc2876a2)
********************************************************************************************** 31
2019-03-08 23:54:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/2ce996c688e7ca20feb842d898ca38a98af51e1a> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/8842f08df3daaac213124ae0c259ae12f3919f60)
********************************************************************************************** 30
2019-03-08 23:54:47 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://github.com/rg3/youtube-dl/pull/11272/commits/d328b8c6c2a1c984396fedc7ab2b141a11ccbee6> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-03-08 23:54:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/d328b8c6c2a1c984396fedc7ab2b141a11ccbee6> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/2ce996c688e7ca20feb842d898ca38a98af51e1a)
********************************************************************************************** 29
2019-03-08 23:54:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/70d9194719f8f2d0b68aba69d73f13cfa6ab8a2c> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/d328b8c6c2a1c984396fedc7ab2b141a11ccbee6)
********************************************************************************************** 28
2019-03-08 23:54:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/rg3/youtube-dl/pull/11272/commits/a89d4906e72fd28b0373df00045a03da00838075> (referer: https://github.com/rg3/youtube-dl/pull/11272/commits/70d9194719f8f2d0b68aba69d73f13cfa6ab8a2c)
********************************************************************************************** 27
2019-03-08 23:54:55 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-08 23:54:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 9228,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 444040,
'downloader/response_count': 10,
'downloader/response_status_count/200': 10,
'dupefilter/filtered': 8,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 8, 20, 24, 55, 528870),
'log_count/DEBUG': 11,
'log_count/INFO': 9,
'memusage/max': 52969472,
'memusage/startup': 52969472,
'request_depth_max': 9,
'response_received_count': 10,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2019, 3, 8, 20, 24, 21, 463924)}
2019-03-08 23:54:55 [scrapy.core.engine] INFO: Spider closed (finished)
我在哪里错?我怎样才能解决这个问题?任何帮助将不胜感激。
重要提示:设置DOWNLOAD_DELAY
> 2.5,以便GitHub不会阻止您的IP。