<div id="content">
foo <br/>
bar <br/>
</div>
我正在尝试使用以下内容获取上面content
div的内部文本:
response.xpath('//div[@id ="content"]').extract()
这给了我以下内容:
[u'<div id="content"> foo<br/>bar <br/></div>
我怎样才能得到:
foo<br/>bar</br>
答案 0 :(得分:0)
lxml在很多地方都非常不方便 - 获取元素的内部HTML就是其中之一。改编自an answer by lormus:
from lxml import html
def inner_html(element):
return (
(element.text or '') +
''.join(html.tostring(child, encoding='unicode') for child in element)
)
使用中:
>>> from scrapy.selector import Selector
>>> response = Selector(text="""
... <div id="content">
... foo <br/>
... bar <br/>
... </div>
... """)
>>> inner_html(response.css('#content')[0].root)
'\n foo <br>\n bar <br>\n'
答案 1 :(得分:0)
Try this:
''.join(map(methodcaller('strip'), response.xpath('//div[@id ="content"]/node()').extract()))
# output: u'foo<br>bar<br>'
Please note that this changes the <br />
to <br>
by lxml
but if you don't need those inner tags, you could do this:
response.xpath('normalize-space(//div[@id ="content"])').extract_first()
# output: u'foo bar'