Question

我使用Scrapy抓取网站

我的问题是，当我从url中提取href时，我会在网址中获取%20。所以，删除我使用拆分并得到我想要的网址

例如：

原始网址：http://www.example.com/category/%20

我修改后的网址如下：http://www.example.com/category/

所以我将修改后的网址提供给Request方法，但仍然请求方法是采用原始网址而不是修改后的网址

我的解析和提取方法如下

def parse(self, response):
    sel = Selector(response)
    requests = []

    # Get Product Reviews
    for url in sel.xpath('//div[contains(@id,"post")]/div/div[2]/h3/a/@href').extract():
        url = url.encode('utf-8').split('%')[0]
        requests.append(Request(url, callback=self.extract))

    for request in requests:
        print request.url
        yield request

def extract(self, response):
    sel = Selector(response)
    requestedItem = ProductItem()
    requestedItem['name'] = sel.xpath('//*[@id="content-wrapper"]/div/div[1]/div[1]/div/div/h1/text()').extract()[0].encode('utf-8')
    requestedItem['description'] = sel.xpath('//*[@id="content-wrapper"]/div/div[1]/div[2]/div/div/div[1]/p/text()').extract()[0].encode('utf-8')

    yield requestedItem

所以，请任何人帮我解决这个问题

Answer 1

请查看以下答案（以及相关问题）：Scrapy: URL error, Program adds unnecessary characters(URL-codes)

正如您所见，网址中添加了空格。为此，您可以在选择网址时normalize-space，或在提出请求之前简单地strip。

这是因为％20是一个空格 - 只有在您调用网址时才会进行转义，而您在网址末尾看不到%20。

所以不要使用

url = url.encode('utf-8').split('%')[0]

你可以

for url in sel.xpath('normalize-space(//div[contains(@id,"post")]/div/div[2]/h3/a/@href)').extract():
    requests.append(Request(url, callback=self.extract))

或

for url in sel.xpath('//div[contains(@id,"post")]/div/div[2]/h3/a/@href').extract():
    requests.append(Request(url.strip(), callback=self.extract))

Scrapy请求URL出错

1 个答案: