Jsoup提取选择器组直到达到指定的选择器

时间:2015-08-09 08:56:43

标签: java jsoup

我有以下页面:

backImage.image = [UIImage imageNamed:@"LaunchImage-700-Landscape~ipad.png"];

正如您所看到的,大多数选择器都处于同一级别。我试图弄清楚如何使用Jsoup一次废弃一个块。 阻止意味着以<div> <h3>...</h3> <span>...</span> <p>...</p> <span...</span> <span...</span> <span...</span> <p>...</p> <span...</span> <span...</span> <hr /> <h3>...</h3> <span>...</span> <p>...</p> <p>...</p> <hr /> <h3>...</h3> <span>...</span> <span>...</span> <p>...</p> <p>...</p> <hr /> </div> 开头并以<h3>结尾的所有选择器 (在上面的例子中有3个块)。 两者之间的选择器是组合的,数量可以变化。

我阅读了官方API documentation,但无法找到合适的方法。

1 个答案:

答案 0 :(得分:1)

package stack;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class Stack {

    public static void main(String args[]) throws Exception {
        File input = new File("test.html");
        Document doc = Jsoup.parse(input, "UTF-8");

        List<Elements> blocks = new ArrayList<>();

        Elements listofh3 = doc.getElementsByTag("h3");
        for(Element h3 : listofh3) {
            Elements block = new Elements();
            block.add(h3);
            Element cursor = h3;
            while(!cursor.tagName().equals("hr")) {
                cursor = cursor.nextElementSibling();
                block.add(cursor);
            }
            blocks.add(block);
        }

        for(Elements block : blocks) {
            System.out.println(block);
            System.out.println("----------------------------");
        }
    }
}

另一个解决方案可能是这个

package stack;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class Stack {

    public static void main(String args[]) throws Exception {
        File input = new File("test.html");
        Document doc = Jsoup.parse(input, "UTF-8");

        Elements listofh3 = doc.getElementsByTag("h3");
        for(Element h3 : listofh3) {
            Element span = doc.createElement("span");
            span.addClass("block");

            Element cursor = h3;
            while(!cursor.tagName().equals("hr")) {
                Element next = cursor.nextElementSibling();
                span.appendChild(cursor);
                cursor = next;
            }
            cursor.remove(); //remove hr
            doc.body().appendChild(span);
        }

        System.out.println(doc);
    }
}

测试输入

<div>
 <h3>header 1</h3> 
 <span>span 1</span>
 <p>p 1</p>
 <span>span 11</span>
 <span>span 111</span>
 <span>span 1111</span>
 <p>p 11</p>
 <span>span 11111</span>
 <span>span 111111</span>
 <hr />

 <h3>header 2</h3> 
 <span>span 2</span>
 <p>p 2</p>
 <p>p 22</p>
 <hr />

 <h3>header 3</h3> 
 <span>span 3</span>
 <span>span 33</span>
 <p>p 3</p>
 <p>p 33</p>
 <hr />
</div>