我一直在试图弄清为什么jsoup的.select(“ div.zn-body__paragraph”)尚未在某些CNN文章上进行。对于像this这样的文章,尽管有明显的标记,它仍然无法工作,而像this这样的文章却可以工作。这是我编写的完整代码:
public static String getContentCNN(String link) throws IOException{
String finalString = "";
Elements paragraphs = getDocsCNN(link).select("div.zn-body__paragraph");
for (Element p : paragraphs) {
finalString += p.text() + "\n\n";
}
return finalString;
}
它们都有这样的分隔器类:
<div class="zn-body__paragraph">Nadler on Wednesday said he didn't know the White House's motives, but he would not allow the White House to try to claim that the President cannot be held accountable.</div>
<div class="zn-body__paragraph">"I don't know whether they're trying to taunt us toward an impeachment or anything else," Nadler said. "All I know is they have made a preposterous claim."</div>
到目前为止,我已经尝试过div#class,div [class]和getElementByClass(“ class”)
谢谢。
编辑:这是getDocsCNN()的源代码:
public static Document getDocsCNN(String link) throws IOException{
return Jsoup.connect(link).userAgent("Mozilla").timeout(6000).get();
}