首选解决方案

Question

我有一个示例HTML，我想用kuchiki来解析：

<a href="https://example.com"><em>@</em>Bananowy</a>

我只需要Bananowy，而不需要@。

JavaScript的类似问题：How to get the text node of an element?

Answer 1

首先，让我们从解析器的解析方式开始：

    <a href="https://example.com"><em>@</em>Bananowy</a>

进入一棵树。参见下图：

现在，如果您尝试做明显的事情并调用anchor.text_contents()，则将获得锚标记（<a>）的所有文本节点后代的所有文本内容。这就是text_contents根据CSS定义的行为方式。

但是，您只想获得"Bananowy"的几种方法：

extern crate kuchiki;

use kuchiki::traits::*;

fn main() {
    let html = r"<a href='https://example.com'><em>@</em>Bananowy</a>";

    let document = kuchiki::parse_html().one(html);

    let selector = "a";
    let anchor = document.select_first(selector).unwrap();
    // Quick and dirty hack
    let last_child = anchor.as_node().last_child().unwrap();
    println!("{:?}", last_child.into_text_ref().unwrap());

    // Iterating solution
    for children in anchor.as_node().children() {
        if let Some(a) = children.as_text() {
            println!("{:?}", a);
        }
    }

    // Iterating solution - Using `text_nodes()` iterators
    anchor.as_node().children().text_nodes().for_each(|e| {
        println!("{:?}", e);
    });

}

第一种方法是脆性，hacky方法。您只需要了解"Bananowy"是锚标记的last_child，并相应地提取anchor.as_node().last_child().unwrap().into_text_ref().unwrap()。

第二种解决方案是遍历锚标记的子项（即[Tag(em), TextNode("Bananowy")]），并使用（as_text()方法）仅选择文本节点。我们使用方法as_text()执行此操作，该方法为所有None以外的Nodes返回TextNode。这比第一个解决方案不那么脆弱，第一个解决方案如果例如您有<a><em>@</em>Banan<i>!</i>owy</a>。

编辑：

首选解决方案

四处寻找后，我发现了一个更好的解决您的问题的方法。它称为TextNodes iterator。

请牢记这一点，只需编写anchor.as_node().children().text_nodes().<<ITERATOR CODE GOES HERE>>;，然后根据需要映射或操作条目即可。

为什么此解决方案更好？更简洁，它使用了老式的Iterator，因此与您在上面给出的JS答案非常相似。

如何使用kuchiki仅获取TEXT_NODE

1 个答案:

首选解决方案