Question

我正在尝试使用以下指南学习Scrapy for Python：http://brutalsimplicity.github.io/2016/07/25/scrapy.html。我已按照说明操作，我可以通过scrapy shell手动扫描一些数据，但我有针对我想要的问题。当我尝试定位以下div类时：

<div class="overthrow table_container" id="div_pbp">

使用@class：

response.xpath('//div[@class="overthrow table_container"]')

然后它有效，我得到了

[<Selector xpath='//div[@class="overthrow table_container"]' data=u'<div class="overthrow table_container" i'>]

作为回应，但是当我尝试用@id做同样的事情时：

response.xpath('//div[@id="div_pbp"]')

我得到空括号[]作为回应。

Edit1：我正在使用Windows 10，Python版本2.7.13，Scrapy版本1.4.0 我错误地制定了我的查询，还是有其他一些言论？

Edit2：我注意到scrapy shell上的输出被切掉了。使用// div查看所有div时，我得到以下输出： Image
这可能是问题吗？你如何告诉scrapy获得整个选择器而不是切断它？

Edit3：另一个例子：
当使用网站上的开发工具时，我看到选择器应该是：
<Selector xpath='//div[@id="all_game_info"]' data=u'<div id="all_game_info" class="table_wrapper columns'>

当我通过以下方式访问它时：

response.xpath('//div[@id="all_game_info"]')

然后我得到：

[<Selector xpath='//div[@id="all_game_info"]' data=u'<div id="all_game_info" class="table_wra'>]

因此它削减了一部分。当我现在尝试使用类似

的类变量进行搜索时

response.xpath('//div[@class="table_wra"]')

或

response.xpath('//div[@class="table_wrapper columns"]')

，然后我得到空括号[]

顺便说一句：这一切都在Scrapy shell中

Answer 1

我终于得到了解决方案并重复了你所面临的问题。我既不使用Scrapy shell也不使用IPython，所以你看起来会略有不同。

我使用requests库获取页面内容，然后使用scrapy HtmlResponse对象为自己提供使用xpath表达式搜索页面的功能。

我的经历和你的一样。

>>> url = 'http://www.pro-football-reference.com/boxscores/201409040sea.htm'
>>> from scrapy.selector import Selector
>>> import requests
>>> page = requests.get(url).content
>>> response = HtmlResponse(url,body=page)
>>> response.xpath('//div[@class="overthrow table_container"]')
[<Selector xpath='//div[@class="overthrow table_container"]' data='<div class="overthrow table_container" i'>]
>>> response.xpath('//div[@id="div_pbp"]')
[]

所以我查看了div_pbp的HTML。我应该不感到惊讶：这种情况经常发生。我们要找的是评论，页面中只有一个。

编辑：如果语句的输出是文本和大量的，那么您可以使用的一种策略是将该输出保存为Python名称，然后将该名称写入文件以供检查。像这样：

enormousOutput = <statement>
open('temp.txt', 'w').write(enormousOutput)

Python - Scrapy /可以使用@class查找Xpath查询，但不能使用@id查找

1 个答案: