Question

我已经创建了一个选择器来从某些html元素中抓取某个字符串。元素中有两个字符串。我的选择器在下面的脚本中，我可以解析它们，而我希望得到后者，在这种情况下I wanna be scraped alone。我如何使用任何选择器来为第一个要解析的字符串创建屏障？

这是html元素：

html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
    <span class="undesirable-content">I shouldn't be parsed</span>
    I wanna be scraped alone
</a>
"""

我尝试过：

from lxml.html import fromstring

root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
    print(item.text_content())

我得到的输出：

 I shouldn't be parsed
 I wanna be scraped alone

预期产出：

I wanna be scraped alone

顺便说一下，我也试过root.cssselect(".expected-content:not(.undesirable-content)")，但这绝对不是正确的做法。任何帮助都将受到高度赞赏。

Answer 1

对于这个问题的具体例子，最好的答案是：

for item in root.cssselect(".expected-content"):
    print(item.tail)

，element.tail返回最后一个孩子之后的文本。但是，如果所需文本位于子节点之前或之间，则无法使用此功能。因此，这是一个更强大的解决方案：

根据文件

item.text_content()：

返回元素的文本内容，包括文本内容孩子，没有标记。

因此，如果您不想要孩子的文本，请先删除它们：

from lxml.html import fromstring

html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
    <span class="undesirable-content">I shouldn't be parsed</span>
    I wanna be scraped alone
</a>
"""

root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
    for child in item:
        child.drop_tree()
    print(item.text_content())

请注意，此示例中还返回了一些空白区域，我确信它很容易清理。

无法创建适当的选择器来解析某个字符串

1 个答案: