Question

我想抓一个站点118.69.35.146/sjc/来测试SCRAPY框架。我使用HTMLXPathSelector来选择，该任务的代码片段如下：

def parse(self, response):
    sel = HtmlXPathSelector(response)
    sites = sel.select('//table[@id="grv_GiaVangUpdate"]/tr')
    items = []
    for site in sites:
        item = FinanceItem()
        item['buy'] = site.select('//td[3]/text()').extract()
        item['sell'] = site.select('//td[4]/text()').extract()
        items.append(item)
    return items

我希望得到和之间的文本数据值。

但是JSON文件输出中的结果我只有16个节点的空值。

[{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []},
{"sell": [], "buy": []}]

请问这位专家请为我检查一下这段代码并告诉我哪些方面我错了。

提前致谢！

Answer 1

查看页面的源代码并在Firepath中测试查询，我发现它应该可以正常工作。

确认您确实获得了相同的网页：在pdb之后添加ipdb / import ipdb; ipdb.set_trace()个断点（sel = HtmlXPathSelector(response)）并查看response内的内容。然后一步一步地调试程序，看看它失败的地点和原因。

Answer 2

您应该始终使用scrapy shell（scrapy shell 'http://118.69.35.146/sjc/'）ant测试xpath，而不仅仅是使用其他工具。

对于此网站，对于相同的元素，firefox具有类似<td align="center">34,870</td>的内容，而scrapy具有<td align="center"><font face="Times New Roman" color="Black" size="3">34,870</font></td>。所以你想要'//td[3]/font/text()'或者更好'//td[3]//text()'。

但是你会遇到其他问题......当你site.select('//td[3]/text()').extract()时，你正在搜索所有树，而不只是在'//table[@id="grv_GiaVangUpdate"]/tr'里面，我想，你想要的。您应该使用'.//td[3]/text()'，并在开头点。

注意：不推荐使用select，而是使用xpath（）。

如何在Scrapy中使用HtmlXpathSelector获取数据？

2 个答案: