如何在python XPath中连接br中的字符串?

时间:2015-11-20 11:28:57

标签: python xpath lxml

尝试解决如何在<br>内连接字符串无效。

以下是代码:

<li class="attr">
    <span>
        Size:L
        <br>
        Color:RED
    </span>
</li>

我尝试使用这些但不起作用:

color_and_size = row.xpath('.//li[@class="attr"][1]/span[1]/text()')[0]

2 个答案:

答案 0 :(得分:1)

似乎您的xml结构已损坏,因为没有关闭</br>标记 - 所以如果您使用lxml然后尝试使用Beautifulsoup的soupparser - 或者您可以使用下面的独立Beutifulsoup- < / p>

from bs4 import BeautifulSoup
s = """<li class="attr">
    <span>
        Size:L
        <br>
        Color:RED
    </span>
</li>
"""
soup = BeautifulSoup(s)

print map(lambda x: x.text.strip().replace("\n",""),soup.find_all('span'))

打印 -

[u'Size:L                Color:RED']

N.B。 Beautifulsoup在内部组织xml,例如如果您想要有效的xml格式错误的xml,请尝试 -

print soup.prettify()

打印 -

<html>
 <body>
  <li class="attr">
   <span>
    Size:L
    <br/>
    Color:RED
   </span>
  </li>
 </body>
</html>

如果您的xml有效,则下面的xpath会有效 -

//li[@class='attr']/span/text()[preceding-sibling::br or following-sibling::br]

Live Demo 只需点击 Test 按钮

答案 1 :(得分:1)

您可以将Python字符串方法与lxml的XPath返回值结合使用:

>>> import lxml.html
>>> text = '''<html>
... <li class="attr">
...     <span>
...         Size:L
...         <br>
...         Color:RED
...     </span>
... </li>
... </html>'''
>>> doc = lxml.html.fromstring(text)
>>>
>>> # text nodes can contain leading and trailing whitespace characters
>>> doc.xpath('.//li[@class="attr"]/span[1]/text()')
['\n        Size:L\n        ', '\n        Color:RED\n    ']
>>> 
>>> # you can use Python's strip() method
>>> [t.strip() for t in doc.xpath('.//li[@class="attr"]/span[1]/text()')]
['Size:L', 'Color:RED']

如果<span>包含<br>span[br]代替span[1]

,您也可以对其进行测试
>>> doc.xpath('.//li[@class="attr"]/span[br]/text()')
['\n        Size:L\n        ', '\n        Color:RED\n    ']
>>> [t.strip() for t in doc.xpath('.//li[@class="attr"]/span[br]/text()')]
['Size:L', 'Color:RED']
>>>