解析1个链接后刮刀结束

时间:2014-09-13 16:15:33

标签: python web-scraping

我一直在写这个网络刮刀,我无法弄清楚它为什么会结束。这是代码:

import scrapy, MySQLdb, urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy import Request


class MyItems(scrapy.Item):
    topLinks = scrapy.Field()
    artists = scrapy.Field()

class mp3Spider(CrawlSpider):
    name = 'mp3_scraper'
    allowed_domains = [
        'example.com'
    ]
    start_urls = [
        'http://www.example.com'
    ]

    def __init__(self, *a, **kw):
        super(mp3Spider, self).__init__(*a, **kw)

        self.item = MyItems()

    def parse(self, response):
        f = open('topLinks', 'w')
        self.item['topLinks'] = response.xpath("//div[contains(@class, 'en')]/a[contains(@class, 'hash')]/@href").extract()

        for x in range(len(self.item['topLinks'])):
            self.item['topLinks'][x] = 'http://www.example.com' + self.item['topLinks'][x]

        for x in range(len(self.item['topLinks'])):
            f.write(format(self.item['topLinks'][x]).encode('utf-8')+ '\n')
            yield Request(url=self.item['topLinks'][x], callback=self.parse_artists)

    def parse_artists(self, response):
        f = open('artists', 'w')
        self.item['artists'] = response.xpath("//ul[contains(@class, 'artist_list')]/li/a/text()").extract()

        for x in range(len(self.item['artists'])):
            f.write(format(self.item['artists'][x]).encode('utf-8') + '\n')

因此,两个解析函数都可以获取我需要的信息,但是parse_artists只解析1个链接。解析函数抓取我需要的所有链接,我可以看到它确实存在,因为我将它们打印到文件中。所以说它抓住链接:example.com/artists/a,example.com/artists/b等。解析艺术家只会刮掉example.com/artists/a然后停止。任何帮助将不胜感激,谢谢。 -SAM

编辑:输出日志 -

C:\Python27\python.exe C:/Users/sam/PycharmProjects/mp3_scraper/mp3_scraper/mp3_scraper/main.py
2014-09-13 12:28:24-0400 [scrapy] INFO: Scrapy 0.24.2 started (bot: mp3_scraper)
2014-09-13 12:28:24-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-09-13 12:28:24-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mp3_scraper.spiders', 'SPIDER_MODULES': ['mp3_scraper.spiders'], 'BOT_NAME': 'mp3_scraper'}
2014-09-13 12:28:24-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled item pipelines: 
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Spider opened
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/> (referer: None)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/z/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/0..9/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/w/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/x/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/u/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/q/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/v/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/y/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/t/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/o/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/p/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/r/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/n/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/s/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/l/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/h/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/k/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/i/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/g/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/m/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/j/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/f/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/e/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/c/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/d/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/b/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Closing spider (finished)
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 10106,
     'downloader/request_count': 27,
     'downloader/request_method_count/GET': 27,
     'downloader/response_bytes': 887850,
     'downloader/response_count': 27,
     'downloader/response_status_count/200': 27,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 9, 13, 16, 28, 28, 908000),
     'log_count/DEBUG': 29,
     'log_count/INFO': 7,
     'request_depth_max': 1,
     'response_received_count': 27,
     'scheduler/dequeued': 27,
     'scheduler/dequeued/memory': 27,
     'scheduler/enqueued': 27,
     'scheduler/enqueued/memory': 27,
     'start_time': datetime.datetime(2014, 9, 13, 16, 28, 25, 315000)}
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Spider closed (finished)

Process finished with exit code 0

1 个答案:

答案 0 :(得分:0)

您在artists模式下打开w文件,如果该文件已存在,则会截断该文件。因此,在蜘蛛完成后,只有最后一个被抓取的项目保留在文件中。

您应该打开要附加的文件(模式a)来解决问题:

def parse_artists(self, response):
    f = open('artists', 'a')
    ...