scrapy

时间:2016-07-09 03:54:49

标签: python scrapy web-crawler scrapy-spider

我想抓一些新闻网站链接并获取完整新闻。但是链接是相对的

新闻网站为http://www.puntal.com.ar/v2/

链接也是如此

<div class="article-title">
            <a href="/v2/article.php?id=187222">Barros Schelotto: "No somos River y vamos a tratar de pasar a la final"</a>
        </div>

那么相对链接是&#34; /v2/article.php?id = 187222&#34;

我的蜘蛛如下(编辑)

# -*- coding: utf-8 -*-

from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from urlparse import urljoin
from scrapy.http.request import Request
try:
from urllib.parse import urljoin # Python3.x
    except ImportError:
from urlparse import urljoin # Python2.7

from puntalcomar.items import PuntalcomarItem


class PuntalComArSpider(CrawlSpider):
    name = 'puntal.com.ar'
    allowed_domains = ['http://www.puntal.com.ar/v2/']
    start_urls = ['http://www.puntal.com.ar/v2/']

    rules = (
            Rule(LinkExtractor(allow=(''),), callback="parse", follow=True),
        )

    def parse_url(self, response):
        hxs = Selector(response)
        urls = hxs.xpath('//div[@class="article-title"]/a/@href').extract()
        print 'enlace relativo ', urls
        for url in urls:
           urlfull = urljoin('http://www.puntal.com.ar',url
           print 'enlace completo ', urlfull
           yield Request(urlfull, callback = self.parse_item)

    def parse_item(self, response):
        hxs = Selector(response)
        dates = hxs.xpath('//span[@class="date"]')
        title = hxs.xpath('//div[@class="title"]')
        subheader = hxs.xpath('//div[@class="subheader"]')
        body = hxs.xpath('//div[@class="body"]/p')
        items = []
        for date in dates:
            item =  PuntalcomarItem()
            item["date"] = date.xpath('text()').extract()
            item["title"] = title.xpath("text()").extract()
            item["subheader"] = subheader.xpath('text()').extract()
            item["body"] = body.xpath("text()").extract()
            items.append(item)
        return items

但它不起作用

我使用Linux Mint和Python 2.7.6

外壳:

$ scrapy crawl puntal.com.ar
2016-07-10 13:39:15 [scrapy] INFO: Scrapy 1.1.0 started (bot: puntalcomar)
2016-07-10 13:39:15 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'puntalcomar.spiders', 'SPIDER_MODULES': ['puntalcomar.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'puntalcomar'}
2016-07-10 13:39:15 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-10 13:39:15 [scrapy] INFO: Enabled item pipelines:
['puntalcomar.pipelines.XmlExportPipeline']
2016-07-10 13:39:15 [scrapy] INFO: Spider opened
2016-07-10 13:39:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (404) <GET http://www.puntal.com.ar/robots.txt> (referer: None)
2016-07-10 13:39:15 [scrapy] DEBUG: Redirecting (301) to <GET http://www.puntal.com.ar/v2/> from <GET http://www.puntal.com.ar/v2>
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (200) <GET http://www.puntal.com.ar/v2/> (referer: None)
enlace relativo  [u'/v2/article.php?id=187334', u'/v2/article.php?id=187324', u'/v2/article.php?id=187321', u'/v2/article.php?id=187316', u'/v2/article.php?id=187335', u'/v2/article.php?id=187308', u'/v2/article.php?id=187314', u'/v2/article.php?id=187315', u'/v2/article.php?id=187317', u'/v2/article.php?id=187319', u'/v2/article.php?id=187310', u'/v2/article.php?id=187298', u'/v2/article.php?id=187300', u'/v2/article.php?id=187299', u'/v2/article.php?id=187306', u'/v2/article.php?id=187305']
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187334
2016-07-10 13:39:15 [scrapy] DEBUG: Filtered offsite request to 'www.puntal.com.ar': <GET http://www.puntal.com.ar/v2/article.php?id=187334>
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187324
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187321
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187316
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187335
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187308
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187314
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187315
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187317
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187319
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187310
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187298
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187300
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187299
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187306
enlace completo  http://www.puntal.com.ar/v2/article.php?id=187305
2016-07-10 13:39:15 [scrapy] INFO: Closing spider (finished)
2016-07-10 13:39:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 50497,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 726952),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 16,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 121104)}
2016-07-10 13:39:15 [scrapy] INFO: Spider closed (finished)

我尝试了绝对链接并纠正。我并没有真正发生。

1 个答案:

答案 0 :(得分:0)

i[1:]很奇怪,是关键问题。切片不需要:

def parse(self, response):
    urls = response.xpath('//div[@class="article-title"]/a/@href').extract()
    for url in urls:
        yield Request(urlparse.urljoin(response.url, url), callback=self.parse_url)

请注意,我还修复了XPath表达式 - 它需要以//启动,以便在DOM树中的任何级别查找div元素。