我已经慢慢创建了以下(工作)Scrapy蜘蛛,它从新闻网站上检索新闻文章和其他一些数据。我遇到的问题是其中一个项目中有很多空格。我在Scrapy帮助文件和stackoverflow(How To Remove White Space in Scrapy Spider Data)中找到了我应该使用Item加载器。我不知道如何在我现有的代码中集成项目加载器。此代码源自Scrapy教程中的标准scraper。对我而言,与项目加载器相关的代码很难与教程中解释的相结合。
import scrapy
from datetime import timedelta, date
from nos.items import NosItem
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2015, 8, 19)
end_date = date(2015, 8, 20)
nos_urls = []
for single_date in daterange(start_date, end_date):
nos_urls.append(single_date.strftime("http://nos.nl/nieuws/archief/%Y-%m-%d"))
class NosSpider(scrapy.Spider):
name = "nos"
allowed_domains = ["nos.nl"]
start_urls = nos_urls
def parse(self, response):
for sel in response.xpath('//*[@id="archief"]/ul/li'):
item = NosItem()
item['name'] = sel.xpath('a/@href').extract()[0]
item['date'] = sel.xpath('a/div[1]/time/@datetime').extract()[0]
item['desc'] = sel.xpath('a/div[@class="list-time__title link-hover"]/text()').extract()[0]
url = response.urljoin(item['name'])
request = scrapy.Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
def parse_dir_contents(self, response):
for sel in response.xpath('//*[@id="content"]/article'):
item = response.meta['item']
textdata = sel.xpath('section//text()').extract()
textdata = " ".join(textdata)
#textdata = textdata.replace("\n", "")
#textdata = textdata.strip(' \t\n\r\\n')
item['article'] = textdata
yield item
这是我目前获得的JSON导出示例:
{"date": "2015-08-19T15:43:26+0200", "article": "\n Man met bijl aangehouden \n \n \n De man zou zijn vrouw hebben aangevallen met een bijl en dreigde zichzelf iets aan te doen.\n Video afspelen \n 00:34\n De politie heeft in Schijndel een man aangehouden die verdacht wordt van huiselijk geweld. De man had zichzelf in een woning opgesloten en dreigde zichzelf iets aan te doen. [text cut off]", "name": "/artikel/2052794-politie-in-schijndel-heeft-handen-vol-aan-verdachte-huiselijk-geweld.html", "desc": "Politie in Schijndel heeft handen vol aan verdachte huiselijk geweld"}
文章项目包含了空格,还有许多我想删除的内容。
我相信这些功能有助于解决问题:
l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars)
l.default_output_processor = Join()
答案 0 :(得分:3)
您可以在提取中使用unicode.strip()
:
textdata = " ".join(map(unicode.strip,textdata))
这将从您的数据中删除所有空格,并使文章内容更清晰。