Question

我需要找出字符串中的单词数。但是，此字符串不是正常类型的字符串。它有很多特殊的字符，如＆lt; ，/ em，/ p等等。因此，StackOverflow中使用的大多数方法都不起作用。因此，我需要自己定义一个正则表达式。

我打算做的是使用正则表达式定义单词是什么，并计算单词出现的时间。这就是我定义一个单词的方式。它必须以一个字母开头，并以其中一个结尾：或者，或者！要么？或'或 - 或）或。或“

这就是我定义正则表达式的方法

pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
matcher = pattern.matcher(line);
while (matcher.find()) 
wordCount++;

但是，第一行

出错

pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");

如何解决此问题？

Answer 1

这有帮助吗？

String line = "so.this:is,what)you!wanted?";
int wordCount = 0;
Pattern pattern = Pattern.compile("([a-zA-Z]++[:'-,\\.!\\?\")]{1})");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    wordCount++;
}
System.out.println(wordCount); // Prints 6

Answer 2

事实上，您还想删除标签，例如<em>（HTML强调），否则将被视为单词。如果您再考虑带有属性的完整标记： <span font="Consolas">然后删除标记会更容易：

public int static wordCount(String s) {
    s.replaceAll("<[A-Za-z/][^>]*>", " ") // Tags as space
        .replaceAll("[^\\p{L}\\p{M}\\d]+", " ") // Non-letters, -accents, -digits as blank
        .trim() // Not before or after (empty words)
        .split(" ").length;
}

这是非常低效的，replaceAll和trim。至少预编译和使用Pattern会更好。但可能不值得。

找出具有很多特殊字符的字符串中的单词数

2 个答案: