Question

我有以下标记

<div class="example">
    <p> text <a href="#"> link </a> text</p>
</div>

我想要

<p> text <a href="#"> link </a> text</p>

所以div中的所有内容都与类示例有关。我正在使用

from lxml import html
page = requests.get('X')
tree = html.fromstring(page.content)

description = tree.xpath('//div[@class="example"]/p//text()')

它给了我一个段落标签列表，然后我与

一起加入

description = ' '.join('<p>{0}</p>'.format(paragraph) for paragraph in description)

但必须有一种方法可以直接获取div中的内容？谢谢卡尔

Answer 1

我找到了一个解决方案......不是很漂亮，但它给了我想要的东西......

dummy = tree.xpath('//div[@class="example"]/div[2]/div/node()')   
description = ''
for paragraph in dummy:
    try:
        description += html.tostring(paragraph)
    except:
        pass

Answer 2

您只需要获取标记内的所有节点：

h = """<div class="example">
<p> text <a href="#"> link </a> text</p>
<p> othertext <a href="#"> otherlink </a> text</p>
</div>"""

from lxml import html

x = html.fromstring(h)

print("".join(html.tostring(n) for n in x.xpath("//div[@class='example']/*")))

输出：

<p> text <a href="#"> link </a> text</p>
<p> othertext <a href="#"> otherlink </a> text</p>

或使用.iterchildren：

"".join(html.tostring(n) for n in x.xpath("//div[@class='example']")[0].iterchildren())

没有任何尝试/除了。

网页抓取......通过包含其他标签的标签获取所有内容

2 个答案: