尝试解决如何在<br>
内连接字符串无效。
以下是代码:
<li class="attr">
<span>
Size:L
<br>
Color:RED
</span>
</li>
我尝试使用这些但不起作用:
color_and_size = row.xpath('.//li[@class="attr"][1]/span[1]/text()')[0]
答案 0 :(得分:1)
似乎您的xml结构已损坏,因为没有关闭</br>
标记 - 所以如果您使用lxml
然后尝试使用Beautifulsoup的soupparser - 或者您可以使用下面的独立Beutifulsoup- < / p>
from bs4 import BeautifulSoup
s = """<li class="attr">
<span>
Size:L
<br>
Color:RED
</span>
</li>
"""
soup = BeautifulSoup(s)
print map(lambda x: x.text.strip().replace("\n",""),soup.find_all('span'))
打印 -
[u'Size:L Color:RED']
N.B。 Beautifulsoup在内部组织xml,例如如果您想要有效的xml格式错误的xml,请尝试 -
print soup.prettify()
打印 -
<html>
<body>
<li class="attr">
<span>
Size:L
<br/>
Color:RED
</span>
</li>
</body>
</html>
如果您的xml
有效,则下面的xpath
会有效 -
//li[@class='attr']/span/text()[preceding-sibling::br or following-sibling::br]
Live Demo 只需点击 Test
按钮
答案 1 :(得分:1)
您可以将Python字符串方法与lxml
的XPath返回值结合使用:
>>> import lxml.html
>>> text = '''<html>
... <li class="attr">
... <span>
... Size:L
... <br>
... Color:RED
... </span>
... </li>
... </html>'''
>>> doc = lxml.html.fromstring(text)
>>>
>>> # text nodes can contain leading and trailing whitespace characters
>>> doc.xpath('.//li[@class="attr"]/span[1]/text()')
['\n Size:L\n ', '\n Color:RED\n ']
>>>
>>> # you can use Python's strip() method
>>> [t.strip() for t in doc.xpath('.//li[@class="attr"]/span[1]/text()')]
['Size:L', 'Color:RED']
如果<span>
包含<br>
:span[br]
代替span[1]
>>> doc.xpath('.//li[@class="attr"]/span[br]/text()')
['\n Size:L\n ', '\n Color:RED\n ']
>>> [t.strip() for t in doc.xpath('.//li[@class="attr"]/span[br]/text()')]
['Size:L', 'Color:RED']
>>>