如何在jsoup中获得课堂上的孩子

时间:2016-09-12 19:25:40

标签: java jsoup

我想从网站上删除评论。我在jsoup中的类中获取p标签时遇到了麻烦。示例html代码在

下面
<html>
 <head>
  <title>My webpage</title>
 </head>
 <body>
  <div class="container">
     <div class="comment">
      <p>This is comment</p>
     </div>
  </div>
 </body> 
</html> 

这是我的java代码

public static void main(String args[]){
    Document doc = null;
    try {

        doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").get();
        System.out.println("Connect successfully");
        org.jsoup.select.Elements element =  doc.select("div.post-message");

        System.out.println(element.get(0).text());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}
}

3 个答案:

答案 0 :(得分:2)

您尝试获取的页面的评论部分不是简单的HTML内容。初始页面加载后,注释将通过Javascript加载到DOM。 JSoup是一个HTML解析器,因此您无法通过Jsoup获取页面的注释。要获取此类内容,您需要一个嵌入式浏览器组件。看一下这个答案:Is there a way to embed a browser in Java?

以下代码适用于您提供的特定HTML字符串。

试试这个:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;    
public class Test {   

public static void main(String[] arg)
{ 
    Document doc = null; 
    try { 

        doc = Jsoup.parse("<html> "
                + "<head>  "
                + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     "
                + "<div class=\"comment\">      "
                + "<p>This is comment</p>    "
                + " </div>  </div> </body></html> ");

                Elements element = doc.select(".container").select(".comment"); 
                System.out.println(element.get(0).select("p").text()); 

    } 
    catch (Exception e) 
    { 
        e.printStackTrace(); } 

}   
}

用于连接网址:

doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").timeout(60*1000).userAgent("Mozilla").get();

答案 1 :(得分:1)

要扩展Arijit的解决方案,如果有多个 while(!(itemType == "b" || itemType == "m" || itemType == "d" || itemType == "t" || itemType == "c")){ cout<<"Enter the item type-b,m,d,t,c:"<<endl; cin>>itemType; cout<<itemType<<endl; } cout<<itemType; 标记带有<div>类,您可以尝试:

comment

如果有其他标记共享Document doc = null; try { doc = Jsoup.parse("<html> " + "<head> " + "<title>My webpage</title> " + "</head> <body> <div class=\"container\"> " + "<div class=\"comment foo\"> " + "<p>This is comment</p> " + " </div> </div> </body></html> "); Elements comments = doc.getElementsByAttributeValueMatching("class", "comment"); Iterator<Element> iter = comments.iterator(); while(iter.hasNext()) { Element e = iter.next(); System.out.println(e.getElementsByTag("p").text()); } } catch (Exception e) { e.printStackTrace(); } 课程,您可以使用comment检查它是e.tagName()

答案 2 :(得分:0)

如果你的目标是打印This is comment,你可以尝试这样的事情:

org.jsoup.select.Elements element = doc.select("div.container").select("div.comment");
System.out.println(element.get(0).text());