Question

我有以下句子：

String str = " And God said, <sup>c</sup>&#8220;Let there be light,&#8221; and there was light.";

如何检索句子中的所有单词，期待以下内容？

And
God
said
Let 
there
be
light
and 
there
was
light

Answer 1

首先，摆脱任何前导或尾随空格：

.trim()

然后摆脱HTML实体（&...;）：

.replaceAll("&.*?;", "")

&和;是Regex中的文字字符，而.*?是＆＃34;任何字符的非贪婪版本，任意次数＆＃34;。

接下来摆脱标签及其内容：

.replaceAll("<(.*?)>.*?</\\1>", "")

<和>将再次按字面意思理解，上面解释了.*?，(...)定义了一个捕获组，\\1引用了该组。< / p>

最后，拆分任何非字母序列：

.split("[^a-zA-Z]+")

[a-zA-Z]表示从a到z和A到Z，^的所有字符都会反转匹配，+ }表示＆＃34;一次或多次＆＃34;。

所以一切都将是：

String words = str.trim().replaceAll("&.*?;", "").replaceAll("<(.*?)>.*?</\\1>", "").split("[^a-zA-Z]+");

请注意，这不会处理<img src="a.png" />等自动关闭标记另请注意，如果您需要完整的HTML解析，您应该考虑让真正的引擎解析它，parsing HTML with Regex is a bad idea。

Answer 2

您可以将String.replaceAll（正则表达式，替换）与正则表达式[^ A-Za-z] +一起使用，以仅获取字符。其中还包括sup标签和c。这就是为什么用第一个语句替换标签及其之间的所有标签。

    String str = " And God said, <sup>c</sup>&#8220;Let there be light,&#8221; and there was light.".replaceAll("<sup>[^<]</sup>", "");
    String newstr = str.replaceAll("[^A-Za-z]+", " ");

句子与<sup> </sup>分开

2 个答案: