Jsoup提取p,ul,h3和img

时间:2013-12-19 08:37:03

标签: java jsoup

   String url = "http://www.hardwarezone.com.sg/review-sony-playstation-4-does-greatness-await";
   Document doc = Jsoup.connect(url).get();
   Elements content = doc.select("#content p, #content table ul, #content h3");
   Elements img = doc.select("#content [src]"); 

基本上我要做的是从以下网址中提取p,ul,h3和img:http://www.hardwarezone.com.sg/review-sony-playstation-4-does-greatness-await

我现在面临的问题是让所有内容一个接一个地显示,类似于网站的布局。

我已经尝试使用for循环来生成绝对的img链接,但是通过这样做,布局就会运行。

以下是我使用的代码:

for (Element bb : img)

String src = bb.attr("abs:src");
System.out.println(src);      

1 个答案:

答案 0 :(得分:0)

在提取了你想要的所有元素后,循环遍历并用图像的绝对URL替换所有img元素的src。您可以使用Jsoup中的Node类的absURL()函数来检索它:

String url = "http://www.hardwarezone.com.sg/review-sony-playstation-4-does-greatness-await";
Document doc = Jsoup.connect(url).get();
Elements content = doc.select("#content p, #content table ul, #content h3, #content [src]");

for (Element e : content) {
    if (e.nodeName().equals("img")) {    // if node is <img>
        e.attr("src", e.absUrl("src"));  // set src attribute to be absolute url 
    }
}

如果您打印出内容(例如System.out.println(content)),您会看到所有元素仍然与原始页面上显示的顺序相同,并且已为所有图像插入绝对URL。

例如(请注意,这只是输出的一个小节):

<p class="rtecenter"><img src="http://www.hardwarezone.com.sg/files/img/2013/12/rearports.jpg" width="700" height="232" title="The entire rear side is covered in cooling vents. This is also the first Playstation to ditch all analog connectors." alt="" /></p>
<img src="http://www.hardwarezone.com.sg/files/img/2013/12/rearports.jpg" width="700" height="232" title="The entire rear side is covered in cooling vents. This is also the first Playstation to ditch all analog connectors." alt="" />
<p>Rather frustratingly, especially for a next-gen console that is expected to last at least the next five years, the PS4 doesn't support the Wireless 802.11ac standard, instead utilizing the older 802.11b/g/n network, and even then 5GHz bands are not supported! So you're stuck with&nbsp;2.4 GHz speeds. This makes a wired connection almost mandatory, as downloading games or even large update files over wireless can be extremely sluggish.</p>
<h3 class="page_title">&nbsp;</h3>  

更新:

添加此循环以从<img>元素中删除<p>元素,但保留<a href>元素:

for (Element e : content) {
    if (e.nodeName().equals("p")) {
        for (Element child : e.children()) {
            if (child.nodeName().equals("img")) child.remove();
        }
    }
}