Question

我想为我下载的页面实现KrovetzStemmer。我遇到的最大问题是我不能简单地将body().text()与给定文档一起使用，然后阻止所有单词。原因是因为我需要href个链接，而这些链接根本不应该被阻止。所以我想如果我可以通过href链接获取正文，那么我可以通过href将其拆分，然后使用LinkedHashMap作为Element和布尔值或枚举类型来指定是否Element是文字或链接。

问题是让我们说给定HTML

<!DOCTYPE html>
<html>
<body>
<h1> This is the heading part. This is for testing purposes only.</h1>
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a>
<p>This is the first paragraph to be considered.</p>
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a>
<p>This is the second paragraph to be considered.</p>
<img border="0" src="/images/pulpit.jpg" alt="Pulpit rock" width="304" height="228">
<a href="http://www.thirdsite.com">Third Link</a>
</body>
</html>

我希望能够得到这个：

This is the heading part. This is for testing purposes only.
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a>
This is the first paragraph to be considered.
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a>
This is the second paragraph to be considered.
<a href="http://www.thirdsite.com">Third Link</a>

然后拆分它们然后插入LinkedHashMap，所以如果我这样做：

int i = 1;
for (Entry<Element, Boolean> entry : splitedList.getEntry()) {
      if(!entry.getValue()) { System.out.println(i + ": " + entry.getKey());}
      i++;    
}

然后它会打印出来：

1: This is the heading part. This is for testing purposes only.
3: This is the first paragraph to be considered.
5: This is the second paragraph to be considered.

这样我就可以应用词干并保持迭代的顺序。

现在，我不知道如何实现这一点，因为我不知道如何：

a）仅使用href链接获取正文

b）拆分身体（我知道使用Strings，我们总是可以使用split()，但我在谈论页面正文的元素）

我怎样才能做到以上这两件事？

此外，我不太确定我的解决方案是否是一个好的解决方案。是否有更好/更简单的方法来做到这一点？

Answer 1

现在我理解了您的要求，我在这里用新答案更新帖子：

所以考虑通过解析给定的HTML

来获得html文档doc

您可以获取所有a代码并将其打包在<xmp>代码中（查看here）

for (Element element : doc.body().select("a"))
     element.wrap("<xmp></xmp>");

现在您需要将新HTML加载到doc，因此Jsoup将避免解析<xmp>标记内的内容

 doc = Jsoup.parse(doc.html());
 System.out.println(doc.body().text());

输出结果为：

This is the heading part. This is for testing purposes only.
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a>
This is the first paragraph to be considered.
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a>
This is the second paragraph to be considered.
<a href="http://www.thirdsite.com">Third Link</a>

现在你可以继续使用输出做你想做的事。

根据分割注释

更新代码

for (Element element : doc.body().select("a"))
    element.wrap("<xmp>split-me-here</xmp>split-me-here");  

doc = Jsoup.parse(doc.html());

int cnt = 0;
List<String> splitText = Arrays.asList(doc.body().text().split("split-me-here"));
for (String text : splitText) {
    cnt++;
    if (!text.contains("</a>"))
        System.out.println(cnt + "." + text.trim());
}

以上代码将打印以下输出：

1.这是标题部分。这仅用于测试目的。

3.这是第一段要考虑的内容。

5.这是要考虑的第二段。

将词干分析器与Jsoup集成

1 个答案: