Scrapy不刮整个网站

时间:2012-04-19 20:13:34

标签: python screen-scraping scrapy

我遇到的问题是我的CrawlSpider没有抓取整个网站。我正在尝试抓取一个新闻网站;它收集了大约5900个项目然后退出并且原因“已完成”但是在已删除的项目中存在大的日期差距。我没有使用任何自定义中间件或设置。谢谢你的帮助!

我的蜘蛛(请原谅底部凌乱的列表代码)以及之后日志文件的最后几行:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from news.items import NewsItem
import re

class CrawlSpider(CrawlSpider):
name = 'crawl'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/portal//']
rules = (
    Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'), callback='parse_item', follow=True),
)

def parse_item(self, response):
    p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)")
    hxs = HtmlXPathSelector(response)
    i = NewsItem()
    i['headline'] = hxs.select('//p[@class = "detailedArticleTitle"]/text()').extract()[0].strip().encode("utf-8")
    i['date'] = hxs.select('//div[@id = "DateTime"]/text()').re('\d+/\d+/[12][09]\d\d')[0].encode("utf-8")
    text = [graf.strip().encode("utf-8") for graf in hxs.select('//div[@id = "article"]//div[@style = "LINE-HEIGHT: 100%"]|//div[@id = "article"]//p//text()').extract()]
    text2 = ' '.join(text)
    text3 = re.sub("'", ' ', p.sub(' ', text2))
    i['text'] = re.sub('"', ' ', text3)
    return i

日志输出:

2012-04-19 11:13:57-0700 [crawl] INFO: Closing spider (finished)
2012-04-19 11:13:57-0700 [crawl] INFO: Stored csv feed (5949 items) in: news.csv
2012-04-19 11:13:57-0700 [crawl] INFO: Dumping spider stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.ConnectionLost': 2,
 'downloader/request_bytes': 5778930,
 'downloader/request_count': 12380,
 'downloader/request_method_count/GET': 12380,
 'downloader/response_bytes': 635795595,
 'downloader/response_count': 12378,
 'downloader/response_status_count/200': 6081,
 'downloader/response_status_count/302': 6062,
 'downloader/response_status_count/400': 234,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 4, 19, 18, 13, 57, 343594),
 'item_scraped_count': 5949,
 'request_depth_max': 23,
 'scheduler/disk_enqueued': 12380,
 'spider_exceptions/IndexError': 131,
 'start_time': datetime.datetime(2012, 4, 19, 17, 16, 40, 75935)}
2012-04-19 11:13:57-0700 [crawl] INFO: Spider closed (finished)
2012-04-19 11:13:57-0700 [scrapy] INFO: Dumping global stats:
{}

1 个答案:

答案 0 :(得分:1)

方法parse_item()应该返回加载的项目。见scrapy docs. 像这样:

class MySpider(CrawlSpider):
    name = 'crawl'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/portal/']
    rules = (Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'),
             callback='parse_item', follow=True))

    def parse_item(self, response):
        p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)")
        hxs = HtmlXPathSelector(response)
        i = NewsItem(selector=hxs)
        i.add_xpath('headline', '//p[@class = "detailedArticleTitle"]/text()')
        i.add_xpath('date', '//div[@id = "DateTime"]/text()', 
                    re=('\d+/\d+/[12][09]\d\d'))
        # Do something...
        return i.load_item()

后处理(例如strip()encode("utf-8"))可以在“管道”中进行。

更新:您的代码中存在多处不准确之处:

  • 您的自定义蜘蛛类名称必须与继承类(CrawlSpider)不同,更改其名称(例如,MySpider
  • start_urls不正确:'http://www.domain.com/portal//'有2个斜杠
  • 好的风格是将选择器的参数设置为NewsItem对象定义(i = NewsItem(selector=hxs)