Question

我正在使用crawler4j抓取网站。我正在使用jsoup来提取内容并将其保存在文本格式文件中。然后我使用omegaT来查找这些文本文件中的单词数。

我遇到的问题是文本提取。我正在使用以下函数从html中提取文本。

public static String cleanTagPerservingLineBreaks(String html) {
    String result = "";
    if (html == null)
        return html;
    Document document = Jsoup.parse(html);

    document.outputSettings(new Document.OutputSettings()
            .prettyPrint(false));
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    result = document.html().replaceAll("\\\\n", "\n");
    result = result.replaceAll("&nbsp;", " ");
    result = result.trim();
    result = Jsoup.clean(result, "", Whitelist.none(),
            new Document.OutputSettings().prettyPrint(false));
    return result;
}

在我使用result = document.html().replaceAll("\\\\n", "\n");的行document.text()中，它为我提供了一个格式正确的文本，其中包含适当的空格。但是当我从omegaT进行单词计数时，这些独特的单词没有正确显示。如果我继续使用document.html()，那么我得到一个正确的字数，但是在某些文字之间没有节奏（例如，女性新来的到来和衬衫裤子和牛仔裤和裙子男装所有男装新品）和像强，em这样的标签没有被删除由Jsoup。

有没有办法在所有标签之间放置空格并正确剥离内容？并解释为什么可能发生字数波动的原因。

使用Jsoup和wordcount进行文本提取

0 个答案: