python的scrapy似乎没有从所有可用的URL获取数据

时间:2011-11-26 06:55:33

标签: python screen-scraping web-scraping scrapy

我正在试图抓取thesession.org来创建一个表格,列出每首曲子被添加到memeber的调教书中的次数,这样我就可以找到一些受欢迎的作品来学习。我开始使用scrapy教程here,并尝试修改它以适应我的目的。问题是虽然thesession.org网站似乎有大约10,390个曲调,但我的刮刀只返回其中10个的数据(只有http://www.thesession.org/tunes/index.php上的数据)。我怎样才能获得所有曲调(或排名第一的曲调)的数据?任何建议都将不胜感激。

这是我到目前为止所得到的:

items.py

from scrapy.item import Item, Field

class tuneItem(Item):
    url = Field()
    name1 = Field()
    name2 = Field()
    key = Field()
    count = Field() 
    pass

tune_spider.py

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings

class tunesSpider(CrawlSpider):

    name = "irishtunes"
    allowed_domains = ["thesession.org"]
    start_urls = ["http://www.thesession.org/tunes"]
    rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]

    def parse_tune(self, response):
        x = HtmlXPathSelector(response)

        tune = tuneItem()
        tune['url'] = response.url
        tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
        tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
        tune['key']   = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
        tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
        return tune

我通过打开控制台运行刮刀,转到包含教程的cfg文件的目录,然后运行scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv

这是我得到的:

C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
        {'count': [u'1'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Brendan Begley's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
        {'count': [u'3'],
         'key': [u'Key signature: Amajor'],
         'name1': [u'Carleton County Breakdown'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
        {'count': [u'3'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Kasper's Rant"],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
        {'count': [u'5'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'The Full Of The Bag'],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
        {'count': [u'1'],
         'key': [u'Key signature: Adorian'],
         'name1': [u'The New Steamboat'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
        {'count': [u'4'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u"Galen's Arrival"],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
        {'count': [u'2'],
         'key': [u'Key signature: Amixolydian'],
         'name1': [u'Culloden Day'],
         'name2': [u'strathspey'],
         'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
        {'count': [u'2'],
         'key': [u'Key signature: Aminor'],
         'name1': [u'Miss Sine Flemington'],
         'name2': [u'barndance'],
         'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
        {'count': [u'2'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Joan Martin's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
        {'count': [u'2'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'My Time Inside 2005'],
         'name2': [u'waltz'],
         'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
        {'downloader/request_bytes': 3655,
         'downloader/request_count': 12,
         'downloader/request_method_count/GET': 12,
         'downloader/response_bytes': 31620,
         'downloader/response_count': 12,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
         'item_scraped_count': 10,
         'request_depth_max': 1,
         'scheduler/memory_enqueued': 12,
         'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
        {}

编辑:来自@reclosedev的回答让我顺利。对于任何想知道结果的人来说,这是一个快照...

(1)绝大多数曲调不到10个成员的曲调

enter image description here

(2)我可以从网站上获取的所有10,379首曲子的流行度(根据他们所使用的调整数量来衡量)遵循幂律分布

enter image description here

(3)以下是网站上大约1000个调教书中的曲调,显示了排名靠前的曲调的名称以及他们所在的调教书的数量

enter image description here

2 个答案:

答案 0 :(得分:5)

您需要添加Rule,这将提取指向所有网页的链接,而蜘蛛将会follow

rules = [
    ..., #your existing parse_tune rule
    Rule(
        SgmlLinkExtractor(
             allow=('/index/new\?new_start=\d+',)
        ),
        follow=True,
    ),
]

编辑:

follow=True不是必需的,因为callback=None默认为follow=True

答案 1 :(得分:0)

可以有很多方法,lemme建议最简单的方法:

运行代码十次,替换start_urls或将其循环到范围(10,100,10)

http://www.thesession.org/tunes/index/new?new_start=10
http://www.thesession.org/tunes/index/new?new_start=20
http://www.thesession.org/tunes/index/new?new_start=30
http://www.thesession.org/tunes/index/new?new_start=40
http://www.thesession.org/tunes/index/new?new_start=50
http://www.thesession.org/tunes/index/new?new_start=60
http://www.thesession.org/tunes/index/new?new_start=70
http://www.thesession.org/tunes/index/new?new_start=80
http://www.thesession.org/tunes/index/new?new_start=90