python从xml读取数据

时间:2014-02-14 22:09:52

标签: python xml python-2.7 xpath scrapy

我正在使用python scrapy。

我正试图从xml文件中获取xpath:

def getMasterContainers(self):
    containers=[]
    containersFromXML = self.doc.findall('MasterPage/Containers/xpath')
    for oneXpath in containersFromXML:
        containers.append(oneXpath.text)
    return containers

xml文件是:

<Containers>
  <xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
</Containers>

当我在cmd 上打印结果时,我得到了这个

container = ''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

我的问题

当我尝试sel.xpath(self.containers[0])时,我没有得到任何结果,但是当我在代码中编写xpath时就像这样 sel.xpath('xpath written by hand')我收到了当前的数据。

请帮助。

1 个答案:

答案 0 :(得分:2)

更新:你确定你的麻烦是用这个xpath吗?你确认它没有早于或晚于这个xpath失败吗?我不确定如何使用 scrapy 进行刮擦,所以我只是手动运行XML解析,并在真实文档上运行以下内容,并且测试文档对我有效。

first.xml 仅包含xpath及其父结构:

<websiteInformation>
  <MasterPage>
    <Containers>
      <xpath>.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']</xpath>
    </Containers>
  </MasterPage>
</websiteInformation>

解析 first.xml

from lxml import etree

doc = etree.parse(open('first.xml'))

containers = []
containersFromXML = doc.findall('MasterPage/Containers/xpath')
for oneXpath in containersFromXML:
    print oneXpath.text
    containers.append(oneXpath.text)

输出:

.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']

看起来不错。

test.html 是:

<html>
  <body>
    <div id="results-list">
      <div class="item paid-featured-item">
        <div class="listing-item">Found A</div>
      </div>
      <div class="item paid-featured-item">
        <div class="listing-item">Found B</div>
      </div>
    </div>
  </body>
</html>

用以下方式搜索:

from scrapy.selector import Selector

sel = Selector(text=open('test.html').read())
for container in containers:
    print "Xpath: {}".format(container)
    result = sel.xpath(container)
    print "Container: {}".format(len(result))
    for elem in result:
      print elem

输出:

Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 2
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found A</div>'>
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found B</div>'>

使用wget输出搜索真实网址的结果:

Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 25
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n        \n    '>
# omitted 23
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n        \n    '>

看起来你的xpath字符串有额外的单引号('),它们本身不应该是。在XML中看起来像:

<xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>

解析时(如打印时所示):

''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

你不想要周围的'。这应该是:

.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]

如果您可以编辑包含x路径的XML文件,请从每个'&apos;中删除前导&apos;'和尾随<xpath>。所以:

<Containers>
  <xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
</Containers>

应该成为:

<Containers>
  <xpath>.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]</xpath>
</Containers>

但是如果由于某种原因无法编辑XML文件,则在获取xpath文本后,将其除去周围的' s。所以:

containers.append(oneXpath.text)

应该成为:

containers.append(oneXpath.text.strip("'"))