Question

我的蜘蛛跳过我解析的一些项目，抛出了这个错误。所有项目都在一页上。有20件物品。通常，有3或4个被跳过。有任何建议请：

文件“/home/ec2-user/project/project/pipelines.py”，第19行，在process_item中 'title'：str（item ['title']）， UnicodeEncodeError：'ascii'编解码器无法对位置25中的字符u'\ u201c'进行编码：序数不在范围内（128）

蜘蛛：

def parse(self, response):

    for item in response.xpath("//li[contains(@class, 'river-block')]"):
        url = item.xpath(".//h2/a/@href").extract()[0]
        stamp = item.xpath(".//time/@datetime").extract_first()
        yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp})

def get_details(self, response):
        article = ArticleItem()
        article['title'] = response.xpath("//header/h1/text()").extract_first()
        article['url'] = format(shortener.short(response.url))
        article['stamp'] = response.meta['stamp']
        yield article

管道：

类DynamoDBStorePipeline（object）：

def process_item(self, item, spider):
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2")

    table = dynamodb.Table('db1')

    table.put_item(
    Item={
    'url': str(item['url']),
    'title': str(item['title']),
    'stamp': str(item['stamp']),
    }
    )
    return item

Answer 1

我将'title': str(item['title'])更改为'title': item['title'].encode('utf-8')，现在一切正常

Scrapy会在同一页面上的某些项目上引发UnicodeEncode错误

1 个答案: