Question

我正在尝试使用Scrapy从这个网站上抓取：http://www.fs.fed.us/research/people/profile.php?alias=ggonzalez

这是返回我在蜘蛛中导出的最终项目的函数：

def parse_post(self, response):
    theitems = []
    pubs = response.xpath("//div[@id='pubs']/ul/li/a")
    for i in pubs:
        item = FspeopleItem()
        name = str(response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract()).strip()
        pub = str(i.xpath("@title").extract()).strip() 
        item['link'] = response.url
        item['name'] = name
        item['pub'] = pub
        theitems.append(item)
    return theitems

由于某种原因，返回的“theitems”总是将重音字符（如Díaz中的í）显示为空格。我无法弄清楚为什么会这样。当我打开一个Scrapy shell并从xpath单独打印信息时，它会很好地打印到控制台，但是当它从返回的“theitems”中出来时，它就变成了一个空白区域。我已经在Python2.7和3.5中测试了它。

我是Scrapy的新手，一般编码，一般是python。但是，除了这个编码问题外，一切正常。有谁知道为什么会这样？

谢谢。

/////// ////////

EDIT

感谢您的建议。虽然格式化更好，因为当我使用下面的代码（使用

）时，/ u'的东西就消失了

.encode("utf-8")

和

.extract_first()

在编写我的项目时），带有重音的字符仍然显得很时髦。所以，我看看我正在抓取的网站上的编码，看到他们正在使用ISO-8859-1编码。所以我试过

.encode("ISO-8859-1")

在向项目添加组件时，当我打开.csv时，这正确显示带有重音符号的字符等（所有格式都很棒）。然而，当我这样做时，大约25％的网站没有被删除 - csv有~1400个条目而不是〜2100。我无法弄清楚为什么它不会刮掉一些网站而不是其他网站？

import scrapy

from fspeople.items import FspeopleItem

class FSSpider(scrapy.Spider):
name = "hola"
allowed_domains = ["fs.fed.us"]
start_urls = [
    "http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=SRS&state_id=ALL"]

def __init__(self):
    self.i = 0

def parse(self,response):
    for sel in response.xpath("//a[@title='Click to view their profile ...']/@href"):
        url = response.urljoin(sel.extract())
        yield scrapy.Request(url, callback=self.parse_post)
    self.i += 1

def parse_post(self, response):
    theitems = []
    pubs = response.xpath("//div[@id='pubs']/ul/li")
    for i in pubs:
        item = FspeopleItem()
        name = response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract_first().strip().encode("ISO-8859-1")
        pubname = i.xpath("a/text()").extract_first().strip().encode("ISO-8859-1")
        pubauth = i.xpath("text()").extract_first().strip().encode("ISO-8859-1")

        item['link'] = response.url
        item['name'] = name
        item['pubname'] = pubname
        item['pubauth'] = pubauth
        theitems.append(item)
    return theitems

Answer 1

使用extract_first()和encode()：

for i in pubs:
    item = FspeopleItem()
    name = response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract_first().strip().encode("utf-8")
    pub = i.xpath("@title").extract_first().strip().encode("utf-8") 
    item['link'] = response.url
    item['name'] = name
    item['pub'] = pub
    theitems.append(item)

Answer 2

这是编码/解码问题。

正如史蒂夫所说，它可能只是您用来查看提取数据的软件。

如果不是这样，请尝试删除str()方法，看看会发生什么。或者可以将其更改为unicode() [1]。我通常不使用它们，我只是让字段填充来自response.xpath('...').extract()的任何内容。

此外，确保项目中的所有内容都是utf8：您编写代码的文件，设置和字符串。例如，永远不要写它：

item['name'] = 'First name: ' + name

写这个（unicode！）：

item['name'] = u'First name: ' + name

Python的Scrapy编码问题

2 个答案: