lxml cssselect解析

时间:2011-02-05 21:32:55

标签: python html parsing css-selectors lxml

我有一份包含以下数据的文件:

<div class="ds-list">
    <b>1. </b> 
    A domesticated carnivorous mammal 
    <i>(Canis familiaris)</i> 
    related to the foxes and wolves and raised in a wide variety of breeds.
</div>

我希望获得课程ds-list内的所有内容(不含<b><i>标记)。目前我的代码是doc.cssselect('div.ds-list'),但所有这些代码都是<b>之前的换行符。我怎样才能让它做我想做的事?

2 个答案:

答案 0 :(得分:8)

也许您正在寻找text_content方法?:

import lxml.html as lh
content='''\
<div class="ds-list">
    <b>1. </b> 
    A domesticated carnivorous mammal 
    <i>(Canis familiaris)</i> 
    related to the foxes and wolves and raised in a wide variety of breeds.
</div>'''
doc=lh.fromstring(content)
for div in doc.cssselect('div.ds-list'):
    print(div.text_content())

产量

1.  
A domesticated carnivorous mammal 
(Canis familiaris) 
related to the foxes and wolves and raised in a wide variety of breeds.

答案 1 :(得分:1)

doc.cssselect("div.ds-list").text_content()