Question

我想计算html页面中的单词数量并使用hashmap我想从html页面打印单词和单词的出现

Java代码

public class CountWords {

    public void readFile() {

        Scanner scanner = null;
        try {
            scanner = new Scanner(new File("D:\\Test.html"));
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        Map<String, Integer> map = new HashMap<String, Integer>();
        while (scanner.hasNext()) {
            String word = scanner.next();
            if (map.containsKey(word)) {
                map.put(word, map.get(word) + 1);
            } else {
                map.put(word, 1);
            }
        }

        List<Map.Entry<String, Integer>> entries = new ArrayList(map.entrySet());

        for (int i = 0; i < map.size(); i++) {
            System.out.println(entries.get(entries.size() - i - 1).getKey()
                    + " " + entries.get(entries.size() - i - 1).getValue());
        }
    }

}

我得到的输出是带有html代码的原始数据，我想只打印页面内的文字，我看不到html代码

Answer 1

您可以尝试使用OWASP HTML清理库https://owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project。我以前用它来清理用户提交的帖子，但它应该达到你的要求。由于它是一个允许/限制HTML中特定标记的库，因此您可以告诉它拒绝所有HTML标记并仅提取其中的内容。

您的代码类似于PolicyFactory policy = new HtmlPolicyBuilder().toFactory(); String safeHTML = policy.sanitize(htmlContent);

我发现它比尝试任何类型的正则表达式更不容易出错。

您可能需要来自http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/distrib/lib/的guava.jar和owasp-java-html-sanitizer.jar

Answer 2

您应该删除HTML标记。以下是一个示例：Remove HTML tags from a String

顺便说一下。为什么你的输出如此复杂？

for (Map.Entry<String, Integer> entry : map.entrySet()) {
    System.out.printf("%s %d\n", entry.getKey(), entry.getValue());
}

使用hashmap计算单词数

2 个答案: