Question

在论坛上获取帖子内容时，我很难收到HTML提供的文字。使用org.jsoup.nodes.Document和getElementsByClass我可以检索以下代码段：

<html>
  <head>

  </head>
  <body>
    <div class="entry-content">
      <div>
        <div>
          <div align="center">
            Some text...<br>continued in 2nd line<br> and third line. This is <b>bold</b>.
          </div>
          <br>


          <div align="center">
            Also, here's a link:
          </div>
          <div align="center">
            <a href="http://www.google.com/" target="_blank" rel="nofollow">http://www.google.com/</a>
          </div>
        </div>
        <div class="clear">

        </div>
      </div>
    </div>
  </body>
</html>

将其粘贴到HTML在线编译器中，我将收到以下内容：

如果我复制渲染的表格，我会得到：

Some text...
continued in 2nd line
and third line. This is bold.

Also, here's a link:
http://www.google.com/

这正是我所需要的。我尝试使用JEditorPane的渲染器，但它删除了br换行符。此外，它在底部添加了1或2个不必要的空行。

那么如何以常规文本编辑器格式从此HTML代码段获取正确呈现的文本，或使用jsoup哪个查询返回该文本？

编辑： Java代码

String htmlPageSource = "...";
Document document = Jsoup.parse(htmlPageSource);
String firstPostHtmlCode = getFirstPostHtmlCode();
System.out.println(firstPostHtmlCode);

public String getFirstPostHtmlCode()
{
    Elements userPosts = document.getElementsByClass("entry-content");
    Element firstPost = userPosts.get(0);

    return firstPost.toString();
}

Answer 1

您是否有机会发布您正在使用的代码？你可以尝试做一个string.replace（）并用\ n替换'br'标签。

或者，您可以保留HTML标记，Java将识别标签。只需将文本括在HTML标记中：

 string x = "<html>" + yourHTML + "</html>";

如果没有看到您的代码，请不要确定底部的空行。

获取论坛帖子内容

1 个答案: