Question

从下面的html元素中，我如何选择保留文本hi there!!并使用css选择器丢弃其他文本Cat？此外，使用.text或.text.strip()我不会得到结果，但是当我使用.text_content()时，我会收到文字。

from lxml.html import fromstring

html="""
<div id="item_type" data-attribute="item_type" class="ms-crm-Inline" aria-describe="item_type_c">
    <div>
        <label for="item_type_outer" id="Type_outer">
            <div class="NotVisible">Cat</div>
        Hi there!!
            <div class="GradientMask"></div>
        </label>
    </div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("#Type_outer"):
    print(item.text)  # doesn't work
    print(item.text.strip()) # doesn't work
    print(item.text_content()) # working one

结果：

Cat 
Hi there!!

然而，我想得到的结果只是hi there!!，我尝试的是：

root.cssselect("#Type_outer:not(.NotVisible)") #it doesn't work either

又一次问题：

为什么.text_content()正在运作，但.text或.text.strip()不是？
我如何只使用css选择器获取hi there!!？

Answer 1

在lxml树模型中，您要获取的文本位于tail的{{1}}，类别为“NotVisible”：

div

因此，要回答第一个问题，只有不在元素前面的文本节点位于父{Q}}属性中。具有前一个兄弟元素的文本节点（如此问题中的那个）将位于该元素的>>> root = fromstring(html) >>> for item in root.cssselect("#Type_outer > div.NotVisible"): ... print(item.tail.strip()) ... Hi there!!属性中。

另一种获取文字“Hi there !!”的方法是通过查询作为text的直接子节点的非空文本节点。可以使用XPath表达式查询此类详细信息：

tail

使用选择器保留某些文本并从其他元素中丢弃其余文本

1 个答案: