在标记之间的python中通过xpath提取值

时间:2014-09-28 18:39:28

标签: python html xpath html-parsing lxml

我想提取下面图片中提到的参数......

我试过的是:

url='http://site.ir'
content=requests.get(url).content
tree = html.fromstring(content)
print [e.text_content() for e in tree.xpath('//div[@class="grouptext"]/????')]

这不是标记 span ,而不是标记 br

图片: enter image description here

更新

想象一下:

out=""" <div class="groupinfo">
    <div class="grouptext">
        <span style="color:#5f0101">
            span tag contents
        </span>
        WHAT I WANT
        <br></br>
    </div>
</div> <div class="groupinfo">
    <div class="grouptext">
        <span style="color:#5f0101">
            span tag contents
        </span>
        WHAT I WANT(1)
        <br></br>
    </div>
</div> 
imagine I have: out=""" <div class="groupinfo">
    <div class="grouptext">
        <span style="color:#5f0101">
            span tag contents
        </span>
        WHAT I WANT(2)
        <br></br>
    </div>
</div> <div class="groupinfo">
    <div class="grouptext">
        <span style="color:#5f0101">
            span tag contents
        </span>
        WHAT I WANT(3)
        <br></br>
    </div>
</div> """"""

2 个答案:

答案 0 :(得分:1)

另一个选择是将跟随转换为span文字兄弟

//div[@class="grouptext"]/span[1]/following-sibling::text()

演示:

from lxml import html

data = """
<div class="groupinfo">
    <div class="grouptext">
        <span style="color:#5f0101">
            span tag contents
        </span>
        WHAT I WANT
        <br></br>
    </div>
</div>
"""

tree = html.fromstring(data)
print tree.xpath('//div[@class="grouptext"]/span[1]/following-sibling::text()')[0].strip()

打印:

WHAT I WANT

对于更新的示例,以下是对我有用的内容:

for result in tree.xpath('//div[@class="grouptext"]/span/following-sibling::text()'):
    print result.strip()

打印:

WHAT I WANT

WHAT I WANT(1)

WHAT I WANT(2)

WHAT I WANT(3)

答案 1 :(得分:0)

看起来这是div元素的文本内容。不幸的是,“你想要的东西”是不可读的,因为你潦草地写着“我想要什么”。

您(最有可能)寻找的是文本节点,实际上并非“在标签之间”,它是div[@class="grouptext"]元素的子节点。可能有多个此类文本节点作为此div的子节点。

尝试:

print [e.text_content() for e in tree.xpath('//div[@class="grouptext"]')]

或者

print tree.xpath('//div[@class="grouptext"]/text()')

可能也可以,但我对Python不是很熟悉。