Question

刚刚开始使用Scrapy，到目前为止我一直运气好，直到这个问题。我似乎无法找到＆＃39;这里的排名表;

http://www.baseball-reference.com/leagues/MLB/2016-standings.shtml#all_expanded_standings_overall

它有id =＆＃39;＃expanded_standings_overall＆＃39;但我无法用蜘蛛或贝壳找到它。我能够得到#all_expanded_standings_overall的结果，因为有一个带有该ID的div。在shell中提取这个内容会向我显示我想要的表格，但即使在这个内容中，我也无法找到它与“＃body”相关的表格。或者＆＃39; tr＆＃39;或者我尝试过的任何其他事情。

Answer 1

如果您查看了网页来源，就会看到有问题的ID（ expanded_standings_overall ）

<div class="placeholder"></div>
<!--
    <div class="table_outer_container">
        <div class="overthrow table_container" id="div_expanded_standings_overall">
            <table class="sortable stats_table" id="expanded_standings_overall" data-cols-to-freeze=2>
                <caption>MLB Detailed Standings</caption>
                    ... sweet data here ..
                </table>
        </div>
    </div>
-->
</div>

HTML评论似乎是将内容隐藏到我们无辜刮刀的技巧;）

有趣的是，Firebug没有显示这个评论......？

克服此问题的一种方法是提取注释，删除它们并继续处理注释中的数据。例如：

$ scrapy shell www.baseball-reference.com/leagues/MLB/2016-standings.shtml
>>> view(response)
>>> from scrapy.selector import Selector
>>> sel = Selector(response)
>>> sel.xpath('//table[@id="expanded_standings_overall"]')
[]
>>> import re
>>> regex = re.compile(r'<!--(.*)-->', re.DOTALL)
>>> for comment in sel.xpath('//comment()').re(regex):
>>>     table = Selector(text=comment).xpath('//table[@id="expanded_standings_overall"]')
>>>     print(table)
...
[]
[]
[<Selector xpath='//table[@id="expanded_standings_overall"]' data='<table class="sortable stats_table" id="'>]
[]
[]

如您所见，我更喜欢XPATH选择器而不是CSS，但它们原则上是相同的，请参阅https://doc.scrapy.org/en/latest/topics/selectors.html。

Scrapy没找到表css

1 个答案: