Question

我仍在学习scrapy，并试图从此页面中搜集一些信息：Schlotzskys store

然而，在使用scrapy通过scrapy shell解析页面后，我遇到了一些问题，特别是在解析站点上的地址时。

首先，我在shell中运行以下命令：

pipenv run scrapy shell https://www.schlotzskys.com/find-your-schlotzskys/arkansas/fayetteville/2146/

一切顺利。然后我试图抓地址。我尝试了以下方法：

response.css('div.col-xs-12 col-sm-6 col-md-6')
response.css('div.container locations-mid-container')
response.xpath('//div[@class="locations-info"]')
response.css('div.locations-address')

上面的前两个输入返回：

[]

后两个输入返回：

Selector xpath =“descendant-or-self :: div [@class and contains（concat（'） '，normalize-space（@class），''），'locations-address'）] / text（）“ data ='\ n \ t \ t \ t \ t \ t131 N. McPherson Church Rd。\ t \ t \ t \ t'

或其变体。

现在我查看了以下HTML：

print(response.text)

我感兴趣的HTML确实出现了，但似乎没有在scrapy中解析。它似乎可能是破坏HTML，我想知道是否有任何解决方法吗？

我非常感谢任何人的帮助！

Answer 1

我无法通过第一个表达式中给出的CSS选择器在页面上找到元素。您的所有表达都缺少extract()或extract_first()方法调用，因此您正在使用Selector s。

试试这个：

address = [
    response.xpath('normalize-space(//div[@class="locations-address"])').extract_first(),
    response.xpath('normalize-space(//div[@class="locations-address-secondary"])').extract_first(),
    response.xpath('normalize-space(//div[@class="locations-state-city-zip"])').extract_first()
]

normalize-space() XPath函数删除了恼人的空格。

Scrapy response.css / xpath与HTML损坏。有小费吗？

1 个答案: