Jsoup在H2标签之前删除所有内容

时间:2017-05-12 10:41:43

标签: java html jsoup extract

我使用Jsoup.connect()方法从网站获取HTML源代码。以下是来自该HTML源代码的一段代码(链接:https://docs.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community

.....
<p>When you set dependencies in your VSIX manifest, you must specify Component IDs 
   only. Use the tables on this page to determine our minimum component dependencies. 
   In some scenarios, this might mean that you specify only one component from a workload. 
   In other scenarios, it might mean that you specify multiple components from a single 
   workload or multiple components from multiple workloads. For more information, see 
   the 
<a href="../extensibility/how-to-migrate-extensibility-projects-to-visual-studio-2017" data-linktype="relative-path">How to: Migrate Extensibility Projects to Visual Studio 2017</a> page.</p>
.....
<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>
.....
<h2 id="see-also">See also</h2>
.....

使用jsoup我想要做的是,我想删除<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>之前的每一个Html片段

,以及(包括)<h2 id="see-also">See also</h2>

之后的所有内容

我有这样的解决方案,但这对我来说几乎没有用:

        try {
            document = Jsoup.connect(Constants.URL).get();
        }
        catch (IOException iex) {
            iex.printStackTrace();
        }
        document = Parser.parse(document.toString().replaceAll(".*?<a href=\"workload-and-component-ids\" data-linktype=\"relative-path\">Visual Studio 2017 Workload and Component IDs</a> page.</p>", "") , Constants.URL);
        document = Parser.parse(document.toString().replaceAll("<h2 id=\"see-also\">See also</h2>?.*", "") , Constants.URL);
        return null;

任何帮助都将不胜感激。

2 个答案:

答案 0 :(得分:1)

简单的方法可能是:将页面的整个html作为字符串,创建所需部分的子字符串,并使用jsoup再次解析该子字符串。

        Document doc = Jsoup.connect("https://docs.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community").get();
        String html = doc.html().substring(doc.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-8, 
                                           doc.html().indexOf("unaffiliated-components")-8);
        Document doc2 = Jsoup.parse(html);
        System.out.println(doc2);

答案 1 :(得分:1)

我只是对@eritrean上面的回答做了一点改动。我可以通过很小的修改来获得所需的输出。

document = Jsoup.parse(document.html().substring(document.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-26,
                document.html().indexOf("see-also")-8));
System.out.println(document);