Question

我正在匹配元素文本中的特定字符串，并希望将匹配的文本包裹在一个范围内，以便能够选择它并稍后应用修改，但是html实体已被转义。有没有一种方法可以将带有html标签的字符串转义？

我尝试使用unescapeEntities()方法，但是在这种情况下不起作用。 wrap()不能正常工作。有关这些方法的参考，请检查https://jsoup.org/apidocs/org/jsoup/parser/Parser.html

当前代码：

for (Element div : doc.select("div")) {
    for (String input : listOfStrings) {
        if (div.ownText().contains(input)) {
            div.text(div.ownText().replaceFirst(input, "<span class=\"select-me\">" + input + "</span>"));
        }
    }
}

所需的输出

<div>some text <span class="select-me">matched string</span></div>

实际输出

<div>some text <span class="select-me">matched string</span></div>

Answer 1

根据您的问题和评论，您似乎只想修改所选元素的直接文本节点，而无需修改所选文本的潜在内部元素的文本节点，因此对于

<div>a b <span>b c</span></div>

如果我们想修改b，我们只修改直接放在<div>中的一个，而不修改<span>中的一个。

<div>a b <span>b c</span></div> 
       ^       ^----don't modify because it is in <span>, not *directly* in <div>
       |
     modify

文本不像ElementNode <div>等那样被认为是<span>，但是在DOM中，文本表示为TextNode，因此如果我们具有<div> a <span>b</span> c </div>这样的结构，则其DOM表示为

Element: <div>
├ Text: " a "
├ Element: <span>
│ └ Text: "b"
└ Text: " c "

如果我们想将部分文本包装到<span>（或任何其他标签）中，我们将有效地分割单个TextNode

├ Text: "foo bar baz"

分为以下系列：

├ Text: "foo "
├ Element: <span>
│ └ Text: "bar"
└ Text: " baz"

要创建使用该思想的解决方案TextNode，API给我们提供了非常有限的工具集，但是在可用的方法中，我们可以使用

splitText(index)，它修改原始TextNode并在其中保留拆分的“左侧”，并返回新的TextNode，该TextNode保留拆分的其余（右侧），就像TextNode node1之后保留"foo bar"一样TextNode node2 = node1.splitText(3); node1将持有"foo"，而node2将持有" bar"，并将被放置为node1之后的直接同级兄弟
wrap(htmlElement)（从超类{{1}继承）将TextNode包装在表示Node的ElementNode中，例如htmlElement，将得到node.wrap("<span class='myClass'>")。

使用上述“工具”，我们可以创建类似

的方法

<span class='myClass>text from node</span>

我们可以这样使用：

static void wrapTextWithElement(TextNode textNode, String strToWrap, String wrapperHTML) {

    while (textNode.text().contains(strToWrap)) {
        // separates part before strToWrap
        // and returns node starting with text we want
        TextNode rightNodeFromSplit = textNode.splitText(textNode.text().indexOf(strToWrap));

        // if there is more text after searched string we need to
        // separate it and handle in next iteration
        if (rightNodeFromSplit.text().length() > strToWrap.length()) {
            textNode = rightNodeFromSplit.splitText(strToWrap.length());
            // after separating remining part rightNodeFromSplit holds
            // only part which we ware looking for so lets wrap it
            rightNodeFromSplit.wrap(wrapperHTML);
        } else { // here we know that node is holding only text to wrap
            rightNodeFromSplit.wrap(wrapperHTML);
            return;// since textNode didn't change but we already handled everything
        }
    }
}

结果：

Document doc = Jsoup.parse("<div>b a b <span>b c</span> d b</div> ");
System.out.println("BEFORE CHANGES:");
System.out.println(doc);

Element id1 = doc.select("div").first();
for (TextNode textNode : id1.textNodes()) {
    wrapTextWithElement(textNode, "b", "<span class='x'>");
}

System.out.println();
System.out.println("AFTER CHANGES");
System.out.println(doc);

Answer 2

评论中的详细说明：

import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

public class StackOverflow56717248 {

    public static void main(String[] args) {
        List<String> listOfStrings = new ArrayList<>();
        listOfStrings.add("INPUT");
        Document doc = Jsoup.parse(
                "<div id=\"1\">some text 1</div>" +
                "<div id=\"2\"> node before <b>xxx</b> this one contains INPUT text <b>xxx</b> node after</div>");
        System.out.println("BEFORE: ");
        System.out.println(doc);
        // iterating over all the divs
        for (Element div : doc.select("div")) {
            // and input texts
            for (String input : listOfStrings) {
                // to find the one with desired text
                if (div.ownText().contains(input)) {
                    // when found we have to be aware that this node may not be the only child
                    // so we have to iterate over children nodes
                    for (int i = 0; i < div.childNodeSize(); i++) {
                        Node child = div.childNode(i);
                        // taking into account only TextNodes
                        if (child instanceof TextNode && ((TextNode) child).text().contains(input)) {
                            TextNode textNode = ((TextNode) child);
                            // when found the one matching we can split text node
                            // into two nodes breaking it on position of desired text
                            // which will be inserted as a next sibling node
                            int indexOfInputText = textNode.text().indexOf(input);
                            textNode.splitText(indexOfInputText);
                            // getting the next node (the one newly created!)
                            TextNode nodeWithInput = (TextNode) textNode.nextSibling();
                            // we have to split it again in case there is more text after the input text
                            nodeWithInput.splitText(input.length());
                            // now this node contains only input text so we can wrap it with whatever you want
                            nodeWithInput.wrap("<span class=\"select-me\"></span>");
                            break;
                        }
                    }
                }
            }
        }
        System.out.println("--------");
        System.out.println("RESULT:");
        System.out.println(doc);
    }

}

如何用<span>或任何其他HTML标签包装部分文本而又不逃脱新的HTML结构？

2 个答案: