Question

我正在尝试让一个Scrapy蜘蛛抓取一个网站，但我想要的项目所需的一个元素是用西班牙语写的，使用带有波浪号（í）的元音。

titulo = title.select（u'。// [“TítuloOriginal：”] / text（）'。extract（）

我在这里发现了类似的问题，但接受的答案并不适用于我。

在字符串的开头添加u可以解决一些问题，但却给出了错误

UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 21: ordinal not in range(128)

我在这里发现其他问题建议使用'... / text（）'。decode（'utf-8）但是这样做或使用.encode（'utf-8'）而不是给我错误

    exceptions.ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

是否有我缺少的东西或其他一些方式，或者我最好是制作一个正则表达式来捕捉我的字符串中的其他部分但是那封信？

这是我到目前为止的代码：

 def parse(self, response):
    #change the response to an HtmlResponse to allow for utf-8 encoding of the body.
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)

print '\n\nresponse encoding', response.encoding ##the page is encoded in utf-8

hxs = HtmlXPathSelector(response)
    titles = hxs.select('//div[@class="datosespectaculo"]')

    items = []
    for title in titles:          
        item = CarteleraItem()
        titulo=title.select(u'.//["Título Original:"]/text()'.encode('utf-8')).extract()
        Ano=title.select('.//span[@itemprop="copyrightYear"]/text').extract()
        item ["title"] = titulo
        item ["Ano"] = Ano   
        items.append(item)

以下是参考网页的来源

<div id="contgeneral">
<div class="contyrasca">
<div id="contfix">
<div class="contespectaculo">

<div class="colizq"><div itemscope itemtype="http://schema.org/Movie">
<h1 class="titulo" itemprop="name">15.361</h1>

<img class="afiche" src="http://www.cartelera.com.uy/imagenes_espectaculos/musicdetail13/14770.jpg"/>
<div class="datosespectaculo">

<strong>Título Original:</strong> <em>15.361</em><br />

<strong>Año: </strong><span itemprop="copyrightYear">2014</span><br />
<strong>Género: </strong><span itemprop="genre">Comedia/Drama</span><br />
<strong>Duración: </strong><span itemprop="duration">60&#39;</span><br />
<strong>Calificación: </strong>+18 años<br />

Answer 1

如果# -*- coding: utf-8 -*-不起作用，您可以使用unicode字符串，其中非ASCII字符使用\u转义序列。

所以XPath选择器变为：

titulo=title.select(u'.//["T\u00edtulo Original:"]/text()'.encode('utf-8')).extract()

我通常使用简单的Python shell会话来检查转义序列：

paul@wheezy:~$ python
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'.//["Título Original:"]/text()'
u'.//["T\xedtulo Original:"]/text()'
>>> u'.//["T\u00edtulo Original:"]/text()'
u'.//["T\xedtulo Original:"]/text()'
>>>

Answer 2

尝试将以下行添加到python文件的开头：

# -*- coding: utf-8 -*-

有关完整说明，read the docs。

scrapy选择器字符串不接受国际字符

2 个答案: