Question

在提出这个问题之前，我已经查看了多个表单。基本上，我需要的是选择HTML文件中的部分文本。 html的构造类似于

<div class = "pane big">
<code>
<pre>
SomeText
<a id="par1" href="#par1">¶</a>
MoreText
.
.
.
<a id="par2" href="#par2">¶</a>
MoreText
</pre>
</code>
</div>

所以我需要做的是，在href标签par1下提取文本，然后分别在par2 href标签下获取文本。我试图使用Jsoup，但我所能做的就是用div选择整个文本。还尝试了XPath，但我不接受我评估的表达式。不确定可能因为它不是一个XML文件开头。

我使用的XPath表达式的例子是。

/html/body/div/div[2]/code[2]/pre/text()[3]

和CSS

body > div > div.pane.big > code:nth-child(7) > pre

Answer 1

等等，所以你需要href标签里面的部分，对吗？说我们有 <a id="par1" href="#iNeedThisPart">¶</a>，那么你想要＆＃39; iNeedThisPart＆＃39;？如果这确实是你想要的，那么你需要使用css查询a [href]，这将选择所有＆＃39; a＆＃39;标签包含＆＃39; href＆＃39;属性。相同的JSoup代码如下：

public List<String> getTextWithinHrefAttribute(final File file) throws IOException{
    final List<String> hrefTexts = new ArrayList<>();
    final Document document=Jsoup.parse(file,"utf-8");
    final Elements ahrefs =document.select("a[href]");

    for(final Element ahref : ahrefs ){
        hrefTexts.add(ahref.attr("href"));
    }
    return hrefTexts;
}

我假设您正在从文件解析，而不是抓取网页。

Answer 2

使用纯CSS选择器无法做到这一点，需要在Java代码中添加额外的提取和附加逻辑：

选择预元素
将 a 元素拆分为文本部分序列作为拆分器。
跳过第一个元素并加入两个（或更多）下一个部分。

这里有简单的代码示例（带有流API的JDK 1.8样式和旧的JDK 1.5 - 1.7样式）：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;

import static java.util.Arrays.stream;
import static java.util.stream.Collectors.joining;

public class SimpleParser {
    public static void main(String[] args) throws IOException {
        final Document document = Jsoup.parse(new File("div.html"), "UTF-8");
        final Elements elements = document.select("div.pane.big pre");

        System.out.println("JDK 1.8 style");
        System.out.println(
                stream(elements.html().split("\\s+<a.+</a>\\s+"))
                        .skip(1)
                        .collect(joining("\n")
                        ));

        System.out.println("\nJDK 1.7 style");
        String[] textParts = elements.html().split("\\s+<a.+</a>\\s+");
        StringBuilder resultText = new StringBuilder();
        for (int i = 1; i < textParts.length; i++) {
            resultText.append(textParts[i] + "\n");
        }
        System.out.println(resultText.toString());
    }
}

P.S。请注意，HTML代码示例中的最后一个标记 div 应为封闭标记。

在html中选择一部分文本使用Java

2 个答案: