Question

目前我的蜘蛛根据需要获取结果，但是用unicode编码（UTF-8，我相信）。当我将这些结果保存到csv时，我有大量的清洁工作，因此所有[u＆amp; Scrapy插入的其他字符。

我如何将结果存储为拉丁字符，＆amp;不是unicode。我究竟需要在哪里进行更改？

感谢。 -TM

Answer 1

item_extracted是unicode类型。您可以将其编码为拉丁语（在解析函数中）或item pipeline或output processor

最简单的方法是将此行添加到您的解析函数

item_to_be_stored = item_extracted.encode('latin-1','ignore')

或者您可以在项目类中定义一个函数。

from scrapy.utils.python import unicode_to_str

def u_to_str(text):
    unicode_to_str(text,'latin-1','ignore')

class YourItem(Item):
    name = Field(output_processor=u_to_str())

Answer 2

如果您的问题与您说的一样，那么解决方案就像投射到字符串一样简单。

>>> a = u'spam and eggs'
>>> a
u'spam and eggs'
>>> type(a)
<type 'unicode'>
>>> b = str(a)
>>> b
'spam and eggs'
>>> type(b)
<type 'str'>

编辑：知道可能发生异常，最好将其包装在try中，除了

try:
    str(a)
except UnicodeError:
    print "Skipping string %s" % a

Scrapy Python spider：将结果存储在Latin-1中，而不是unicode中

2 个答案: