我的蜘蛛跳过我解析的一些项目,抛出了这个错误。所有项目都在一页上。有20件物品。通常,有3或4个被跳过。有任何建议请:
文件“/home/ec2-user/project/project/pipelines.py”,第19行,在process_item中 'title':str(item ['title']), UnicodeEncodeError:'ascii'编解码器无法对位置25中的字符u'\ u201c'进行编码:序数不在范围内(128)
蜘蛛:
def parse(self, response):
for item in response.xpath("//li[contains(@class, 'river-block')]"):
url = item.xpath(".//h2/a/@href").extract()[0]
stamp = item.xpath(".//time/@datetime").extract_first()
yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp})
def get_details(self, response):
article = ArticleItem()
article['title'] = response.xpath("//header/h1/text()").extract_first()
article['url'] = format(shortener.short(response.url))
article['stamp'] = response.meta['stamp']
yield article
管道:
类DynamoDBStorePipeline(object):
def process_item(self, item, spider):
dynamodb = boto3.resource('dynamodb',region_name="us-west-2")
table = dynamodb.Table('db1')
table.put_item(
Item={
'url': str(item['url']),
'title': str(item['title']),
'stamp': str(item['stamp']),
}
)
return item
答案 0 :(得分:1)
我将'title': str(item['title'])
更改为'title': item['title'].encode('utf-8')
,现在一切正常