我正在尝试做类似的事情:Jsoup: How to get all html between 2 header tags
然而,似乎我的代码避免使用纯文本。 我正在解析的网站以这样的方式设置代码:
div class = "quoted-message" Response. Can contain images, text, etc. div class = "quoted-message" Another response to another quoted message
用于处理实际消息的代码片段:
Element quote = msg.select(".quoted-message").first();
Boolean hasQuote = false;
Elements siblings = null;
siblings = quote.siblingElements();
createQuotePost(quote);
List<Element> elementsBetween = new ArrayList<Element>();
for (int i = 1; i < siblings.size(); i++) {
Element sibling = siblings.get(i);
if (! "div.quoted-message".equals(sibling.tagName())) {
elementsBetween.add(sibling);
}
else {
Log.v("location", "Clear and Process");
processElementsBetween(elementsBetween);
elementsBetween.clear();
}
}
if (! elementsBetween.isEmpty())
processElementsBetween(elementsBetween);
然而,这似乎并不像我想要的那样有效。对代码的响应没有任何特殊格式(即:坐在p标签中)。使用一些日志记录,我可以看到他们没有被放入Elements兄弟姐妹。 兄弟姐妹似乎只包括换行符等。
注意:我只在小帖子(简单的一个衬垫)上进行了测试,以节省通过长页打印输出的筛选。
有关该怎么做的任何建议?
编辑: 以下是两个引用消息div之间的HTML代码段:
MESSAGE TO BE QUOTED
</div>
<br />
<br />
Hello quoted message
<br />
I am a response
<br />
<br />
<div class="quoted-message">
答案 0 :(得分:1)
认为其中一个问题是你要求的是元素而不是节点。文本节点是节点而不是元素。
试试这个:
package grimbo.test;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class StackOverflow {
public static void main(String[] args) {
String html = "<div class=quoted-message>message-1</div>\n <br />\n <br />\n Hello quoted message\n <br />\n I am a response\n <br />\n <br />\n";
html += "<div class=quoted-message>message-2</div>\n <br />\n <br />\n Hello quoted message\n <br />\n I am a response\n <br />\n <br />\n";
Document doc = Jsoup.parse(html);
handleQuotedMessages(doc.select(".quoted-message"));
}
private static void handleQuotedMessages(Elements quotedMessages) {
Element firstQuotedMessage = quotedMessages.first();
List<Node> siblings = firstQuotedMessage.siblingNodes();
List<Node> elementsBetween = new ArrayList<Node>();
Element currentQuotedMessage = firstQuotedMessage;
for (int i = 1; i < siblings.size(); i++) {
Node sibling = siblings.get(i);
// see if this Node is a quoted message
if (!isQuotedMessage(sibling)) {
elementsBetween.add(sibling);
} else {
createQuotePost(currentQuotedMessage, elementsBetween);
currentQuotedMessage = (Element) sibling;
elementsBetween.clear();
}
}
if (!elementsBetween.isEmpty()) {
createQuotePost(currentQuotedMessage, elementsBetween);
}
}
private static boolean isQuotedMessage(Node node) {
if (node instanceof Element) {
Element el = (Element) node;
return "div".equals(el.tagName()) && el.hasClass("quoted-message");
}
return false;
}
private static List<Element> filterElements(String tagName, List<Node> nodes) {
List<Element> els = new ArrayList<Element>();
for (Iterator<Node> it = nodes.iterator(); it.hasNext();) {
Node n = it.next();
if (n instanceof Element) {
Element el = (Element) n;
if (el.tagName().equals(tagName)) {
els.add(el);
}
}
}
return els;
}
private static void createQuotePost(Element quote, List<Node> elementsBetween) {
System.out.println("createQuotePost: " + quote);
System.out.println("createQuotePost: " + elementsBetween);
List<Element> imgs = filterElements("img", elementsBetween);
// handle imgs
}
}