使用Scrapy爬行extratorrent.cc

时间:2015-04-12 19:15:47

标签: python xpath web-crawler scrapy

我正在尝试使用Scrapy抓取www.extratorrent.cc。下面是我的蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from extra.items import *

class extraSpider(CrawlSpider):

name = 'extraSpider'
allowed_domains = ['extratorrent.cc']
start_urls = ['http://www.extratorrent.cc/torrent']
rules = [Rule(LinkExtractor(allow=['/\d+/\S+']), 'parse_torrent')]

def parse_torrent(self, response):
    torrent = TorrentItem()
    torrent['url'] = response.url
    torrent['name'] = response.xpath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[2]/table[2]/tbody/tr/td[2]/h1").extract()
    torrent['description'] = response.xpath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[2]/div[4]").extract()
    torrent['size'] = response.xpath("/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td[1]/table/tbody/tr[10]/td[2]").extract()        
    return torrent

在生成的JSON文件中,我只获取url变量而不是其他变量,即。描述,大小和名称。

我不知道我哪里出错了,尝试通过改变Xpath来尝试,但都是徒劳的。我错过了一些非常小的东西。

1 个答案:

答案 0 :(得分:0)

我在代码中做了一些更改,这可能会有所帮助

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from extra.items import *


class extraSpider(CrawlSpider):

    name = 'extraSpider'
    allowed_domains = ['extratorrent.cc']
    start_urls = ['http://www.extratorrent.cc/torrent']
    rules = [Rule(LinkExtractor(allow=['/\d+/\S+']), 'parse_torrent')]

    def parse_torrent(self, response):
        url = response.url
        name = response.xpath(
            "//h1/b/text()").extract()
        name = name[0].strip() if name else 'N/A'
        description = response.xpath(
            '//div[@class="borderdark"]//text()').extract()
        description = ' '.join(
            ' '.join(description).split()) if description else 'N/A'
        size = response.xpath(
            '//td[@class="tabledata1" and contains(text(), "Total Size:")]/following-sibling::td[@class="tabledata0" and position()=1]/text()').extract()
        size = size[0].strip().replace(u'\xa0', ' ') if size else 'N/A'
        torrent = TorrentItem(
            url=url,
            name=name,
            description=description,
            size=size)
        yield torrent

我在这里添加了一些示例输出,

{'description': u'1 CD DVDRip - x264 AAC Stereo Audio English & Arabic Subtitles Specs: Format : Matroska File size : 701 MiB Duration : 2h 9mn Overall bit rate : 756 Kbps Encoded date : UTC 2015-04-14 21:26:04 Video Format : AVC Format/Info : Advanced Video Codec Format profile : [email protected] /* <![CDATA[ */!function(){try{var t="currentScript"in document?document.currentScript:function(){for(var t=document.getElementsByTagName("script"),e=t.length;e--;)if(t[e].getAttribute("cf-hash"))return t[e]}();if(t&&t.previousSibling){var e,r,n,i,c=t.previousSibling,a=c.getAttribute("data-cfemail");if(a){for(e="",r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)}}}catch(u){}}();/* ]]> */ Nominal bit rate : 687 Kbps Width : 640 pixels Height : 272 pixels Display aspect ratio : 2.35:1 Frame rate mode : Constant Frame rate : 23.976 fps Bit depth : 8 bits Scan type : Progressive Audio Format : AAC Format profile : LC Channel(s) : 2 channels Channel positions : Front: L R Sampling rate : 48.0 KHz Language : Hindi Subtitle Language : English & Arabic Chapters : YES SAMPLE Included A DDR Exclusive Release 1 posted by mard22 (2015-04-14 23:37:41) thankssssss 2 posted by harpreets088 (2015-04-15 04:09:57) Thanks 3 posted by j0k3r777 (2015-04-15 05:23:49) thank you dude 4 posted by DEVENDRA_SINGH140992 (2015-04-15 06:25:21) PLS UPLOAD Badlapur 2015 720p BRRIP magic mike xxl music community Flying Swords of Dragon Gate X-Art - Daphne, Katia ado.net sex Taylor Swift - I Knew You Were Trouble arremessa teu bumbum na vara dvd dava foxx interstellar urbin4hd ETTV adult ja rule mary j blige iyisin tabi the blacllist porn peter+andre+big+night Shyla.Stylez, Gianna Michaels you mad or nah x men first class 2011 zuma \u03bf\u03c4\u03b1\u03bd \u03b1\u03b3\u03b3\u03b5\u03bb\u03b7 \u03ba\u03bb\u03b5\u03bd\u03b5 \u03c0\u03b1\u03c0\u03b1\u03c1\u03b9\u03b6\u03bf\u03c5 mistresst truffle butter Nicki Minaj Joanna Angel, Small Hands, Scarlet LaVey ETRG 7 data recovery Hawaii.Five-0 doom 2005 xvid Rang Rasiya picasa39 el-nino-2014 vikings S02E06 souchon voulzy derriere les mots anime straponcum mukti bengali movie movie might med s02e13 The Night That Panic America riley jenner the flash scarface ft lil wayne & bun b forgot about me down teugu 50 cent \u0627\u0646\u0633 \u0643\u0631\u064a\u0645 \u0643\u0644 \u0627\u0644\u0639\u0645\u0631 \u0639\u062d\u0633\u0627\u0628\u0643 \u062a\u062d\u0645\u064a\u0644 YIFY big sean idfwu farrah flower software son of satiyamurti telugu full movie DDR jovanotti si alza il vento omar ruiz el americano 2015 hindi movies video mummyefaa blacked Jillian Janson Yellow Claw/Ayden - Till It Hurts Resident evil animated Monitoring',
     'name': u'Badlapur *2015* 1CD - x264 - DvDrip - AAC - MSubs [DDR] torrent',
     'size': u'708.25 MB',
     'url': 'http://extratorrent.cc/torrent/4140564/Badlapur+*2015*+1CD+-+x264+-+DvDrip+-+AAC+-+MSubs+%5BDDR%5D.html'}