Python解析:lxml只是标记文本的一部分

时间:2010-07-21 17:54:33

标签: python screen-scraping lxml

我在Python中使用HTML看起来像这样。我正在使用lxml解析,但同样可以愉快地使用pyquery:

<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>

拔出'姓名'和'地址'很容易,无论我使用哪个库,但我如何获得剩下的文字 - 即'戴夫戴维斯'?

3 个答案:

答案 0 :(得分:2)

另一种方法 - 使用xpath:

>>> from lxml import html
>>> doc = html.parse( file )
>>> doc.xpath( '//span[@class="Title"][text()="Name"]/../self::p/text()' )
['Dave Davies']
>>> doc.xpath( '//span[@class="Title"][text()="Address"]/../self::p/text()' )
['123 Greyfriars Road, London']

答案 1 :(得分:1)

每个元素都可以有一个text and a tail attribute(在链接中,搜索单词“tail”):

import lxml.etree

content='''\
<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>'''


root=lxml.etree.fromstring(content,parser=lxml.etree.HTMLParser())
for elt in root.findall('**/span'):
    print(elt.text, elt.tail)

# ('Name', 'Dave Davies')
# ('Address', '123 Greyfriars Road, London')

答案 2 :(得分:0)

看看BeautifulSoup。我刚刚开始使用它,所以我不是专家。在我的头顶:

import BeautifulSoup

text = '''<p><span class="Title">Name</span>Dave Davies</p>
          <p><span class="Title">Address</span>123 Greyfriars Road, London</p>'''

soup = BeautifulSoup.BeautifulSoup(text)

paras = soup.findAll('p')

for para in paras:
    spantext = para.span.text
    othertext = para.span.nextSibling
    print spantext, othertext

[Out]: Name Dave Davies
       Address 123 Greyfriars Road, London