Question

我正在尝试使用正则表达式来分割句子。

句子：

"Hallo, I'm a dog. The end. Someone said: \"Earth is Earth\". Is it good? I like it! 'He is right' said I."

当前正则表达式：

\\s+|(?<=[\\p{Punct}&&[^']])|(?=[\\p{Punct}&&[^']])

目前的结果：

{"Hallo", ",", "I'm", "a", "dog", ".", "The", "end", ".", "Someone",
"said", ":", **""**, """ , "Earth", "is", "Earth", """, ".", "Is", "it",
"good", "?", "I", "like", "it", "!", **"'He"**, "is", **"right'"**,
"said", "I", "."}

我在第一个引号之前有额外""并且它没有分开＆＃39;来自文字。

我想要的结果：

{"Hallo", ",", "I'm", "a", "dog", ".", "The", "end", ".", "Someone",
"said", ":", """ , "Earth", "is", "Earth", """, ".", "Is", "it",
"good", "?", "I", "like", "it", "!", "'" , "He", "is", "right", "'",
"said", "I", "."}

编辑：抱歉!更多代码：

String toTest =  "Hallo, I'm a dog. The end. Someone said: \"Earth is Earth\". Is it good? I like it! 'He is right' said I.";
String [] words = toTest.split("\\s+|(?<=[\\p{Punct}&&[^']])|(?=[\\p{Punct}&&[^']])");

并生成单词列表：

words = {＆＃34; Hallo＆＃34;，＆＃34;，＆＃34;，＆＃34;我＆＃39; m＆＃34;，＆＃34; a＆＃34;，＆＃34 ;狗＆＃34;，＆＃34;。＆＃34;，＆＃34;＆＃34;，＆＃34;结束＆＃34;，＆＃34;。＆＃34;，＆＃34;某人＆＃ 34 ;, ＆＃34;说＆＃34;，＆＃34;：＆＃34;，＆＃34;＆＃34; ，＆＃34;＆＃34;＆＃34; ，＆＃34;地球＆＃34;，＆＃34;＆＃34;，＆＃34;地球＆＃34;，＆＃34;＆＃34;＆＃34;，＆＃34;。＆＃34; ，＆＃34;是＆＃34;，＆＃34;它＆＃34;，＆＃34;好＆＃34;，＆＃34;？＆＃34;，＆＃34;我＆＃34;，＆＃34;喜欢＆＃34;，＆＃34;它＆＃34;，＆＃34; ！＆＃34;，＆＃34;＆＃39;他＆＃34; ，＆＃34;＆＃34;，＆＃34;＆＃39;＆＃34; 下，＆＃34;说＆＃34;，＆＃34;我＆＃34;，＆＃34;。＆＃34;}

Answer 1

您可以尝试：

\\s+|(?<=[\\p{Punct}&&[^']])(?!\\s)|(?=[\\p{Punct}&&[^']])(?<!\\s)|(?<=[\\s\\p{Punct}]['])(?!\\s)|(?=['][\\s\\p{Punct}])(?<!\\s)

said: \"Earth的问题在于你在空间之前和之后分裂，所以我添加了一个负面的前瞻和一个负面的后视来分割标点符号。

我还添加了两个用于拆分单引号的案例，如果它们之前或之后是空格或一些标点符号。

但是，正如@RealSkeptic在他的评论中所写，这不会照顾

单引号表示像海豚这样的拥抱。鼻子

你可能需要为此编写一个真正的解析器。

Answer 2

您可以尝试将特殊字符与单词分开：

yoursentence.replaceAll("([^\\w ])", " $1 ").split(" +");

这会弄乱空间，但我想你不需要关心你的句子中有多少人在一起。另外，＆＃34; bit＆＃34;比你的简单：D

可复制代码尝试：

public static void main(String[] args) {
    String s = "Hallo, I'm a dog. The end. Someone said: \"Earth is Earth\". Is it good? I like it! 'He is right' said I.";
    String replaceAll = s.replaceAll("([^\\w ])", " $1 ");
    List<String> asList = Arrays.asList(replaceAll.split(" +"));

    for (String string : asList) {
        System.out.println(string);
    }
}

Answer 3

虽然可以用单个正则表达式解决问题，但我的方法是将工作分成几个步骤，每个步骤都做一件事。

所以我建议你创建一个界面：

public interface IProcess {
    List<String> process (List<String> input);
}

现在你可以从一个包含整个句子作为第一个元素的列表开始，它返回由空格分割的单词：

    return Arrays.asList (input.get (0).split ("\\s+") );

下一步是为每种特殊字符编写处理器并将它们链接起来。例如，您可以在每个单词的末尾删除.,!?以清除下一步的输入。

这样，只要发现错误，您就可以轻松地为每个处理器编写单元测试，并轻松缩小需要改进的链条部分。

JAVA通过单词，标点符号和引号分句

3 个答案: