Question

我无法使用Jsoup Java Library解析以下场景的一些文本。

1：This is My Text some other text as well non empty tag1 other text。

预期输出： some other text as well 

2：This is My Text some other text as well non empty tag2 other text。

预期输出： some other text as well 

3：This is My Text some other text as well non empty tag2 other text non empty tag3。

预期输出： some other text as well 

在这里，如果您注意到文本我的文本是固定（静态）但第二个非空（不将空格视为值）B标签值可能会有所不同。正则表达式应该能够在My Text和第一次出现非空标记之间提取文本。

我正在使用Jsoup库，但无法实现上述预期输出。请确保每种方案的解决方案都应该是通用的，因为在我的情况下它是动态的。

Answer 1

    public static void main(String[] args) {
      String html = "This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag1</b> other text";
      System.out.println(getTargetText(html));
      html = "This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text";
      System.out.println(getTargetText(html));
      html = "This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text <b></b> <b>non empty tag3</b>";
      System.out.println(getTargetText(html));
    }

    public static String getTargetText(String html) {
      Document doc = Jsoup.parse(html);
      Elements bTags = doc.getElementsByTag("b");
      Element startBTag = null;
      Element endBTag = null;

      for (int i = 0; i < bTags.size(); i++) {
        Element bTag = bTags.get(i);
        String text = bTag.text().trim(); // use html() instead of text() if you want to match nested inner tags.
        if (startBTag == null && text.equals("My Text")) {
          startBTag = bTag;
        }
        if (startBTag != null && text.startsWith("non empty tag")) { // here you can use regex match if you want
          endBTag = bTag;
          break;
        }
      }

      if (endBTag != null) {
        String startString = startBTag.outerHtml();
        String endString = endBTag.outerHtml();
        int startIndex = html.indexOf(startString);
        if (startIndex >= 0) {
          int endIndex = html.indexOf(endString, startIndex + startString.length());
          if (endIndex >= 0) {
            return html.substring(startIndex + startString.length(), endIndex);
          }
        }
      }
      return null;
    }

输出：

     some other <b> </b> text as well <b></b>
     some other <b> </b> text as well <b></b>
     some other <b> </b> text as well <b></b>

Answer 2

简单的解决方案可能看起来像

找到您感兴趣的元素（带有您要查找的文本的元素）
迭代后面的兄弟姐妹并打印它们，直到找到非空

你只需要记住Jsoup使用Node来存储所有元素（包括不属于标签的文本），而Element类（扩展Node）可能仅包含特定标签。

例如像

这样的文字

before <b>bold</b> after<i>italic</i>

将表示为

<node>before </node>
<element tag="B">
   <node>bold</node>
</element>
<node> after</node>
<element tag="I">
   <node>italic</node>
</element>

因此，如果您select("b")（会找到<element tab="B">）并致电nextElementSibling()，那么它会将您转移到<element tag="I">。要获得<node>after</node>，您需要使用nextSibling()，这不会消除简单的文本节点。

Node类可能存在的问题是它没有提供text()方法，它可以生成当前节点的文本内容（这可以让我们测试当前节点/元素是否有任何文本）。但是没有什么能阻止我们将标记Node转换为Element来提供这样的方法。

所以我们的解决方案可能如下：

public static String findFragment(String html, String fixedStart) {

    Document doc = Jsoup.parse(html);
    Element myBTag = doc
            .select("b:matches(^" + Pattern.quote(fixedStart) + "$)")
            .first();

    StringBuilder sb = new StringBuilder();
    boolean foundNonEmpty = false;

    Node currentSibling = myBTag.nextSibling();
    while (currentSibling != null && !foundNonEmpty) {
        if (currentSibling.nodeName().equals("b")) {
            Element b = (Element) currentSibling;
            if (!b.text().trim().isEmpty())
                foundNonEmpty = true;
        }
        sb.append(currentSibling.toString());
        currentSibling = currentSibling.nextSibling();
    }

    return sb.toString();
}

无法使用Jsoup HTML解析器Java实现某些功能

2 个答案: