Question

我正在尝试通过报废来获取此网址中的统计信息

http://www.acb.com/redaccion.php?id=133495

我首先尝试播放器名称：

导入scrapy 导入请求来自scrapy.item import Item，Field 来自ligafemanager.items导入LigafemanagerItem

class Lf1Spider(scrapy.Spider):
    name = 'lf1'
    allowed_domains = ['acb.com']
    start_urls = ['http://www.acb.com/redaccion.php?id=133495']
    def parse(self, response):
    self.logger.info('A response from %s just arrived!', response.url)
    i = LigafemanagerItem()
    i['acb_player_name'] = response.xpath('//td/div/codigo/table[1]/tbody/tr/td[2]/font/text()').extract()
    self.logger.info('------------ACB NAME is: %s ------', 
    i['acb_player_name'])
    return i

永远不会返回结果

Answer 1

这是一个棘手的问题，因为你看到的不是真实的事实。考虑来自Firebug的html

现在查看同一页面的查看来源

读取中突出显示的所有内容都是firefox视图源窗口中出错的标记。还要注意缺少一个关键的事情tbody。这就是许多网站所发生的情况，HTML中没有使用tbody但是浏览器会执行自动更正并添加tbody以在浏览器中正确显示表格。

当你使用脚本时，tbody不在源代码中，因为scrapy不会进行任何自动更正，你的XPath with tbody将找不到你感兴趣的元素。这么简单的解决方案？从你的xpath中删除tbody

In [3]: response.xpath('//td/div/codigo/table[1]/tr/td[2]/font/text()').extract()
Out[3]: ['Nombre']

Scrapy不会返回带有xpath的结果

1 个答案: