我遇到的问题是我的CrawlSpider没有抓取整个网站。我正在尝试抓取一个新闻网站;它收集了大约5900个项目然后退出并且原因“已完成”但是在已删除的项目中存在大的日期差距。我没有使用任何自定义中间件或设置。谢谢你的帮助!
我的蜘蛛(请原谅底部凌乱的列表代码)以及之后日志文件的最后几行:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from news.items import NewsItem
import re
class CrawlSpider(CrawlSpider):
name = 'crawl'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/portal//']
rules = (
Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)")
hxs = HtmlXPathSelector(response)
i = NewsItem()
i['headline'] = hxs.select('//p[@class = "detailedArticleTitle"]/text()').extract()[0].strip().encode("utf-8")
i['date'] = hxs.select('//div[@id = "DateTime"]/text()').re('\d+/\d+/[12][09]\d\d')[0].encode("utf-8")
text = [graf.strip().encode("utf-8") for graf in hxs.select('//div[@id = "article"]//div[@style = "LINE-HEIGHT: 100%"]|//div[@id = "article"]//p//text()').extract()]
text2 = ' '.join(text)
text3 = re.sub("'", ' ', p.sub(' ', text2))
i['text'] = re.sub('"', ' ', text3)
return i
日志输出:
2012-04-19 11:13:57-0700 [crawl] INFO: Closing spider (finished)
2012-04-19 11:13:57-0700 [crawl] INFO: Stored csv feed (5949 items) in: news.csv
2012-04-19 11:13:57-0700 [crawl] INFO: Dumping spider stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/twisted.internet.error.ConnectionLost': 2,
'downloader/request_bytes': 5778930,
'downloader/request_count': 12380,
'downloader/request_method_count/GET': 12380,
'downloader/response_bytes': 635795595,
'downloader/response_count': 12378,
'downloader/response_status_count/200': 6081,
'downloader/response_status_count/302': 6062,
'downloader/response_status_count/400': 234,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 4, 19, 18, 13, 57, 343594),
'item_scraped_count': 5949,
'request_depth_max': 23,
'scheduler/disk_enqueued': 12380,
'spider_exceptions/IndexError': 131,
'start_time': datetime.datetime(2012, 4, 19, 17, 16, 40, 75935)}
2012-04-19 11:13:57-0700 [crawl] INFO: Spider closed (finished)
2012-04-19 11:13:57-0700 [scrapy] INFO: Dumping global stats:
{}
答案 0 :(得分:1)
方法parse_item()
应该返回加载的项目。见scrapy docs.
像这样:
class MySpider(CrawlSpider):
name = 'crawl'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/portal/']
rules = (Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'),
callback='parse_item', follow=True))
def parse_item(self, response):
p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)")
hxs = HtmlXPathSelector(response)
i = NewsItem(selector=hxs)
i.add_xpath('headline', '//p[@class = "detailedArticleTitle"]/text()')
i.add_xpath('date', '//div[@id = "DateTime"]/text()',
re=('\d+/\d+/[12][09]\d\d'))
# Do something...
return i.load_item()
后处理(例如strip()
和encode("utf-8")
)可以在“管道”中进行。
更新:您的代码中存在多处不准确之处:
CrawlSpider
)不同,更改其名称(例如,MySpider
)NewsItem
对象定义(i = NewsItem(selector=hxs)
)