Question

我在亚马逊上解析产品评论，我想获得一份评论的完整文本，其中包括链接中的文字。

我目前正在使用jSoup，虽然它很好，但它会忽略锚点。当然，我可以通过使用选择器从锚点获取所有文本，但是我会丢失有关该文本的上下文的信息。

我认为一个例子是解释自己的最佳方式。

结构样本：

<div class="container">
  <div style="a">Something...</div>
  <div style="b">...Nested spans and divs... </div>
  <div class="tiny">_____ </div>
  " From the makers of the incredible <a href="SOMELINK">SOMEPRODUCT</a> we have this other product that blablabla.... Amazing specs, but <a href="SOME_OTHER_LINK">this other product</a> is somehow better".

我获得了什么：“从令人难以置信的制造商那里我们得到了另一种产品blablabla ...惊人的规格，但在某种程度上更好”。

我想要的是：“来自令人难以置信的SOMEPRODUCT的制造商，我们有另外的产品blablabla ...惊人的规格，但这个其他产品在某种程度上更好”。

我的代码使用jSoup：

Elements allContainers = doc.select(".container");
for (Element container : allContainers) {
  String reviewText = container.ownText(); // THIS EXCLUDES TEXT FROM LINKS
StdOut.println(reviewText);

我找不到这样做的方法，因为它看起来不像jSoup将文本节点视为实际节点，因此这些锚点似乎不被认为是下一个节点的子节点之一。

我也对其他想法持开放态度，比如尝试使用：not选择器来获取它们，但我不能相信jSoup不允许保留链接中的文本，这太过于常见了。他们忽略了这个功能。

Answer 1

它看起来不像jSoup将文本节点视为实际节点，

否 - JSoup文本节点是实际节点，元素也是如此。

您描述问题的方式，您有一个非常具体的要求，我同意没有内置功能可以在一次通话中完全按照您的要求进行操作。但是，使用简单的辅助方法，问题是可以解决的。

首先让我们回顾一下问题 - 父div有以下孩子：

div div div #text a #text a # text

当然，每个div和a元素都有其他子元素，包括文本节点。根据您的示例，听起来您想要遍历所有子节点，忽略任何不是文本节点的子节点。当您找到第一个文本节点时，请收集它的文本和任何后续节点的文本。

当然可行，但我并不感到惊讶，没有内置方法可以做到这一点。

以下是解决问题的一种方法：

   public static String textPlus(Element elem)
   {
      List<TextNode> textNodes = elem.textNodes();
      if (textNodes.isEmpty())
         return "";

      StringBuilder result = new StringBuilder();
      // start at the first text node
      Node currentNode = textNodes.get(0);
      while (currentNode != null)
      {
         // append deep text of all subsequent nodes
         if (currentNode instanceof TextNode)
         {
            TextNode currentText = (TextNode) currentNode;
            result.append(currentText.text());
         }
         else if (currentNode instanceof Element)
         {
            Element currentElement = (Element) currentNode;
            result.append(currentElement.text());
         }
         currentNode = currentNode.nextSibling();
      }
      return result.toString();
   }

要调用此用途：

Elements allContainers = doc.select(".container");
for (Element container : allContainers) {
  String reviewText = textPlus(container);
  StdOut.println(reviewText);
}

鉴于您的示例html文本，此代码返回：

“来自令人难以置信的SOMEPRODUCT的制造商，我们有另外的产品blablabla ....惊人的规格，但这个其他产品在某种程度上更好。”

希望这有帮助。

Answer 2

我还没有测试过，但是根据Element类的jsoup API doc，你应该使用方法文本而不是ownText

文本

public String text（）

Gets the combined text of this element and all its children.

For example, given HTML <p>Hello <b>there</b> now!</p>, p.text() returns "Hello there now!"

Returns:
    unencoded text, or empty string if none. 
See Also:
    ownText(), textNodes()

ownText

public String ownText（）

Gets the text owned by this element only; does not get the combined text of all children.

For example, given HTML <p>Hello <b>there</b> now!</p>, p.ownText() returns "Hello now!", whereas p.text() returns "Hello there now!". Note that the text within the b element is not returned, as it is not a direct child of the p element.

Returns:
    unencoded text, or empty string if none. 
See Also:
    text(), textNodes()

Answer 3

我接受了Guido的回答，因为即使它对我不起作用，它肯定会让我走上正轨。

Guido的代码从第一个节点获取文本，然后迭代地继续通过兄弟姐妹。不幸的是，我的真实世界的例子还有两个并发症：

1 - 对于来自锚点的文本仍然没有要求，而不是其他任何要求。我想要更强大的东西，所以我在Guido的结构中添加了这个选择。

2 - 这仍然会从不需要的链接中获取文本，例如每次亚马逊评论结束时的“评论”和“永久链接”链接。其他选择者可以清除它们。

我发布的代码对我有用，以备将来参考。希望它有所帮助： - ）

public static String textPlus(Element elem)
{
    List<TextNode> textNodes = elem.textNodes();
    if (textNodes.isEmpty())
        return "";

    StringBuilder result = new StringBuilder();

    Node currentNode = textNodes.get(0);

    while (currentNode != null)
    {
        // append deep text of all subsequent nodes
        if (currentNode instanceof TextNode)
        {
            TextNode currentText = (TextNode) currentNode;
            String curtext = currentText.text();
            result.append("\n\n" + currentText.text());
        }
        else if (currentNode instanceof Element)
        {
            Element currentElement = (Element) currentNode;
            Elements anchorElements = currentElement.select("a[href]").select(":not(:contains(Comment))").select(":not(:contains(Permalink))");
            if (!anchorElements.isEmpty()) {
                for (Element anchorElement : anchorElements)
                    result.append("\n\n" + anchorElement.text());
            }
        }
        currentNode = currentNode.nextSibling();
    }
    return result.toString().trim();

在文本节点中的锚点中获取文本

3 个答案: