Question

我写了一个简单的脚本来从某个网站提取数据。脚本按预期工作，但我不满意输出格式
这是我的代码

class ArticleSpider(Spider):
    name = "article"
    allowed_domains = ["example.com"]
    start_urls = (
        "http://example.com/tag/1/page/1"
    )

    def parse(self, response):
        next_selector = response.xpath('//a[@class="next"]/@href')
        url = next_selector[1].extract()
        # url is like "tag/1/page/2"
        yield Request(urlparse.urljoin("http://example.com", url))

        item_selector = response.xpath('//h3/a/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin("http://example.com", url),
                      callback=self.parse_article)

    def parse_article(self, response):
        item = ItemLoader(item=Article(), response=response)
        # here i extract title of every article
        item.add_xpath('title', '//h1[@class="title"]/text()')
        return item.load_item()

我不满意输出，例如：

[scrapy] DEBUG：从<200 http://example.com/tag/1/article_name＆gt;刮掉 {＆＃39; title＆＃39;：[u＆＃39; \ xa0＆＃34; \ u0412 \ u041e \ u041e \ u0411 \ u0429 \ u0415- \ u0422 \ u041e \ u0421 \ u0412 \ u041e \ u0411 \ u041e \ u0414 \ u0410 \ u0417 \ u0410 \ u041a \ u0410 \ u041d \ u0427 \ u0418 \ u0412 \ u0410 \ u0415 \ u0422 \ u0421 \ u042f＆＃34;＆＃39;]}

我想我需要使用自定义 ItemLoader 类，但我不知道如何操作。需要你的帮助。

TL; DR 我需要转换文字，由Scrapy从 unicode 转换为 utf-8

Answer 1

正如您在下面所看到的，这不是Scrapy的大部分问题，而是Python本身的问题。它也可能被称为问题：）

I would write this as below.
    'use strict';

    (function () {

        function MyService($http) {

        function getService() {

            var url = yourURL;
            return $http({ method: 'GET', cache: false, url: url });
        }

            return {
                getService: getService

            };
        }

        angular.module('app')
            .factory('MyService', MyService);
    }());

controller code:

      MyService.getService().then(function(response) {

      });

你看到的是同一个东西的两个不同的表示 - 一个unicode字符串。

我建议使用$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya In [7]: print response.xpath('//h1/text()').extract_first() "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ" In [8]: response.xpath('//h1/text()').extract_first() Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'运行抓取或将-L INFO添加到LOG_LEVEL='INFO'，以便不在控制台中显示此输出。

令人讨厌的是，当您保存为JSON时，您将获得转义的unicode JSON，例如

settings.py

给你：

$ scrapy crawl example -L INFO -o a.jl

这是正确的，但需要更多空间，大多数应用程序同样处理非转义的JSON。

在$ cat a.jl {"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}中添加几行可能会改变此行为：

settings.py

基本上我们所做的只是为默认的JSON项目导出器设置from scrapy.exporters import JsonLinesItemExporter class MyJsonLinesItemExporter(JsonLinesItemExporter): def __init__(self, file, **kwargs): super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs) FEED_EXPORTERS = { 'jsonlines': 'myproject.settings.MyJsonLinesItemExporter', 'jl': 'myproject.settings.MyJsonLinesItemExporter', }。这可以防止逃逸。我希望有一种更简单的方法可以将参数传递给导出器，但是我无法看到它们，因为它们是使用here周围的默认参数初始化的。无论如何，现在你的JSON文件有：

ensure_ascii=False

更好看，同样有效且更紧凑。

Answer 2

有两个影响unicode字符串显示的独立问题。

如果返回字符串列表，输出文件会有一些问题，因为它默认使用ascii编解码器来序列化列表元素。您可以按照以下方式解决问题，但根据@neverlastn
的建议使用extract_first()更合适
```
class Article(Item):
    title = Field(serializer=lambda x: u', '.join(x))
```

repr （）方法的默认实现会将unicode字符串序列化为其转义版本\uxxxx。您可以通过在项目类中重写此方法来更改此行为

class Article(Item):
    def __repr__(self):
        data = self.copy()
        for k in data.keys():
            if type(data[k]) is unicode:
                data[k] = data[k].encode('utf-8')
        return super.__repr__(data)

Scrapy从unicode转换为utf-8

2 个答案: