我一直在写这个网络刮刀,我无法弄清楚它为什么会结束。这是代码:
import scrapy, MySQLdb, urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy import Request
class MyItems(scrapy.Item):
topLinks = scrapy.Field()
artists = scrapy.Field()
class mp3Spider(CrawlSpider):
name = 'mp3_scraper'
allowed_domains = [
'example.com'
]
start_urls = [
'http://www.example.com'
]
def __init__(self, *a, **kw):
super(mp3Spider, self).__init__(*a, **kw)
self.item = MyItems()
def parse(self, response):
f = open('topLinks', 'w')
self.item['topLinks'] = response.xpath("//div[contains(@class, 'en')]/a[contains(@class, 'hash')]/@href").extract()
for x in range(len(self.item['topLinks'])):
self.item['topLinks'][x] = 'http://www.example.com' + self.item['topLinks'][x]
for x in range(len(self.item['topLinks'])):
f.write(format(self.item['topLinks'][x]).encode('utf-8')+ '\n')
yield Request(url=self.item['topLinks'][x], callback=self.parse_artists)
def parse_artists(self, response):
f = open('artists', 'w')
self.item['artists'] = response.xpath("//ul[contains(@class, 'artist_list')]/li/a/text()").extract()
for x in range(len(self.item['artists'])):
f.write(format(self.item['artists'][x]).encode('utf-8') + '\n')
因此,两个解析函数都可以获取我需要的信息,但是parse_artists只解析1个链接。解析函数抓取我需要的所有链接,我可以看到它确实存在,因为我将它们打印到文件中。所以说它抓住链接:example.com/artists/a,example.com/artists/b等。解析艺术家只会刮掉example.com/artists/a然后停止。任何帮助将不胜感激,谢谢。 -SAM
编辑:输出日志 -
C:\Python27\python.exe C:/Users/sam/PycharmProjects/mp3_scraper/mp3_scraper/mp3_scraper/main.py
2014-09-13 12:28:24-0400 [scrapy] INFO: Scrapy 0.24.2 started (bot: mp3_scraper)
2014-09-13 12:28:24-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-09-13 12:28:24-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mp3_scraper.spiders', 'SPIDER_MODULES': ['mp3_scraper.spiders'], 'BOT_NAME': 'mp3_scraper'}
2014-09-13 12:28:24-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled item pipelines:
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Spider opened
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/> (referer: None)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/z/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/0..9/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/w/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/x/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/u/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/q/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/v/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/y/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/t/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/o/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/p/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/r/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/n/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/s/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/l/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/h/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/k/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/i/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/g/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/m/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/j/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/f/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/e/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/c/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/d/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/b/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Closing spider (finished)
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 10106,
'downloader/request_count': 27,
'downloader/request_method_count/GET': 27,
'downloader/response_bytes': 887850,
'downloader/response_count': 27,
'downloader/response_status_count/200': 27,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 13, 16, 28, 28, 908000),
'log_count/DEBUG': 29,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 27,
'scheduler/dequeued': 27,
'scheduler/dequeued/memory': 27,
'scheduler/enqueued': 27,
'scheduler/enqueued/memory': 27,
'start_time': datetime.datetime(2014, 9, 13, 16, 28, 25, 315000)}
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Spider closed (finished)
Process finished with exit code 0
答案 0 :(得分:0)
您在artists
模式下打开w
文件,如果该文件已存在,则会截断该文件。因此,在蜘蛛完成后,只有最后一个被抓取的项目保留在文件中。
您应该打开要附加的文件(模式a
)来解决问题:
def parse_artists(self, response):
f = open('artists', 'a')
...