我正在使用python scrapy。
我正试图从xml文件中获取xpath:
def getMasterContainers(self):
containers=[]
containersFromXML = self.doc.findall('MasterPage/Containers/xpath')
for oneXpath in containersFromXML:
containers.append(oneXpath.text)
return containers
xml文件是:
<Containers>
<xpath>''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''</xpath>
</Containers>
当我在cmd 上打印结果时,我得到了这个
container = ''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''
当我尝试sel.xpath(self.containers[0])
时,我没有得到任何结果,但是当我在代码中编写xpath时就像这样
sel.xpath('xpath written by hand')
我收到了当前的数据。
请帮助。
答案 0 :(得分:2)
更新:你确定你的麻烦是用这个xpath吗?你确认它没有早于或晚于这个xpath失败吗?我不确定如何使用 scrapy 进行刮擦,所以我只是手动运行XML解析,并在真实文档上运行以下内容,并且测试文档对我有效。
first.xml 仅包含xpath及其父结构:
<websiteInformation>
<MasterPage>
<Containers>
<xpath>.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']</xpath>
</Containers>
</MasterPage>
</websiteInformation>
解析 first.xml :
from lxml import etree
doc = etree.parse(open('first.xml'))
containers = []
containersFromXML = doc.findall('MasterPage/Containers/xpath')
for oneXpath in containersFromXML:
print oneXpath.text
containers.append(oneXpath.text)
输出:
.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
看起来不错。
test.html 是:
<html>
<body>
<div id="results-list">
<div class="item paid-featured-item">
<div class="listing-item">Found A</div>
</div>
<div class="item paid-featured-item">
<div class="listing-item">Found B</div>
</div>
</div>
</body>
</html>
用以下方式搜索:
from scrapy.selector import Selector
sel = Selector(text=open('test.html').read())
for container in containers:
print "Xpath: {}".format(container)
result = sel.xpath(container)
print "Container: {}".format(len(result))
for elem in result:
print elem
输出:
Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 2
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found A</div>'>
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found B</div>'>
使用wget
输出搜索真实网址的结果:
Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 25
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n \n '>
# omitted 23
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n \n '>
看起来你的xpath字符串有额外的单引号('
),它们本身不应该是。在XML中看起来像:
<xpath>''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''</xpath>
解析时(如打印时所示):
''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''
你不想要周围的'
。这应该是:
.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]
如果您可以编辑包含x路径的XML文件,请从每个''
中删除前导''
和尾随<xpath>
。所以:
<Containers>
<xpath>''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''</xpath>
</Containers>
应该成为:
<Containers>
<xpath>.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]</xpath>
</Containers>
但是如果由于某种原因无法编辑XML文件,则在获取xpath文本后,将其除去周围的'
s。所以:
containers.append(oneXpath.text)
应该成为:
containers.append(oneXpath.text.strip("'"))