Question

我试图学习抓取网页（http://www.expressobeans.com/public/detail.php/185246），但我不知道自己做错了什么。我认为这与识别xpath有关，但我如何获得正确的路径（如果这是问题）？我在Firefox中尝试过Firebug，在Chrome中尝试过开发者工具。

我希望能够抓取制造商价值（D＆amp; L Screenprinting）以及所有版本细节。

python脚本：

from lxml import html
import requests

page = requests.get('http://www.expressobeans.com/public/detail.php/185246')

tree = html.fromstring(page.text)

buyers = tree.xpath('//*[@id="content"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/dl/dd[3]')

print buyers

返回：

[]

Answer 1

从xpath中删除tbody

buyers = tree.xpath('//*[@id="content"]/table/tr[2]/td/table/tr/td[1]/dl/dd[3]')

Answer 2

我首先建议您查看页面HTML并尝试找到更接近您要查找的值的节点，然后从那里构建路径，使其更短，更容易理解。

在那个页面中我可以看到有一个＆＃34; dl＆＃34; with class＆＃34; itemListingInfo＆＃34;在那一个下面你要找的所有信息。

另外，如果你想要＆＃34; D＆amp; L Screenprinting＆＃34;文本，您需要从链接中提取文本。

尝试使用此修改版本，添加其他xpath表达式并获取其他字段应该很简单。

from lxml import html
import requests

page = requests.get('http://www.expressobeans.com/public/detail.php/185246')

tree = html.fromstring(page.text)

buyers = tree.xpath('//dl[@class="itemListingInfo"]/dd[2]/a/text()')

print buyers

使用Python刮取网页

2 个答案: