Question

我正在构建一个Scrapy spider，它将xpath查询作为输入参数。

我正在尝试抓取的特定页面在价格文本字段中包含换行符，换行符和其他字符，我正在使用translate()函数删除它们。

如果代码中明确包含了选择器，则选择器可以正常工作，但如果作为参数传递则转换不起作用。

这是我的蜘蛛：

import scrapy
from spotlite.items import SpotliteItem


class GenericSpider(scrapy.Spider):
   name = "generic"
   xpath_string = ""

   def __init__(self, start_url, allowed_domains, xpath_string, *args, **kwargs):
       super(GenericSpider, self).__init__(*args, **kwargs)
       self.start_urls = ['%s' % start_url]
       self.allowed_domains = ['%s' % allowed_domains]
       self.xpath_string = xpath_string

    def parse(self, response):
       self.logger.info('URL is %s', response.url)
       self.logger.info('xPath is %s', self.xpath_string)
       item = SpotliteItem()
       item['url'] = response.url
       item['price'] = response.xpath(self.xpath_string).extract()
       return item

我使用以下内容来调用蜘蛛。

scrapy crawl generic -a start_url=https://www.danmurphys.com.au/product/DM_4034/penfolds-kalimna-bin-28-shiraz -a allowed_domains=danmurphys.com.au -a "xpath_string=translate((//span[@class='price'])[1]/text(),',$\r\n\t','')"

问题似乎是在论证中传递特殊字符，即\ r \ n \ t。

'$'字符被正确删除，但\ r \ n \ t字符不符合以下输出。

{'price': [u'\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t27.50\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t'],
 'url': 'https://www.danmurphys.com.au/product/DM_4034/penfolds-kalimna-bin-28-shiraz.jsp;jsessionid=B0211294F13A980CA41261379CD83541.ncdlmorasp1301?bmUID=loERXI6'}

任何帮助或建议将不胜感激！

谢谢，

迈克尔

Answer 1

尝试在选择器中使用normalize-space() XPath函数：

scrapy crawl generic -a start_url=<URL> -a \
    allowed_domains=danmurphys.com.au \
    -a "xpath_string=normalize-space(//span[@class='price'][1]/text())"

在parse方法中，您可以使用extract_first()将价格作为单个字符串对象，而不是列表：

item['price'] = response.xpath(self.xpath_string).extract_first()

您还可以使用re_first()方法从字符串中删除$符号：

item['price'] = response.xpath(self.xpath_string).re_first("\$(.+)")

在Scrapy中传递xPath转换函数不适用于特殊字符

1 个答案: