我有一些html文件:
<html>
<body>
<span class="text">One</span>some text1</br>
<span class="cyrillic">Мир</span>some text2</br>
</body>
</html>
如何使用带Python的lxml获取“some text1”和“some text2”?
答案 0 :(得分:5)
import lxml.html
doc = lxml.html.document_fromstring("""<html>
<body>
<span class="text">One</span>some text1</br>
<span class="cyrillic">Мир</span>some text2</br>
</body>
</html>
""")
txt1 = doc.xpath('/html/body/span[@class="text"]/following-sibling::text()[1]')
txt2 = doc.xpath('/html/body/span[@class="cyrillic"]/following-sibling::text()[1]')
答案 1 :(得分:3)
我使用lxml进行xml解析,但我使用BeautifulSoup for HTML。这是一个非常快速/简短的旅程,结束了你的问题的一个解决方案。希望能帮助到你。
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup as soup
>>> stream = open('bs.html', 'r')
>>> doc = soup(stream.read())
>>> doc.body.span
<span class="text">One</span>
>>> doc.body.span.nextSibling
u'some text1'
>>> x = doc.findAll('span')
>>> for i in x:
... print unicode(i)
...
<span class="text">One</span>
<span class="cyrillic">Мир</span>
>>> x = doc('span')
>>> type(x)
<class 'BeautifulSoup.ResultSet'>
>>> for i in x:
... print unicode(i)
...
<span class="text">One</span>
<span class="cyrillic">Мир</span>
>>> for i in x:
... print i.nextSibling
...
some text1
some text2
>>>