Question

我想在没有空格的标题中查找并分隔单词。

之前：

ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T。（测试）“测试”'测试'[测试]

之后：

这是示例标题HELLO-WORLD 2019 T.E.S.T. （测试）[测试]“测试”“测试”

我正在寻找可以执行以下操作的正则表达式规则。

我以为如果每个单词都以大写字母开头，我会确定的。

但是也请保留所有大写单词，以免将它们分隔成A L L U P P E R C A S E。

其他规则：

如果字母碰到一个数字，请放一个空格：Hello2019World Hello 2019 World
忽略包含句点，连字符或下划线T.E.S.T.的空格首字母
忽略括号，括号或引号[Test] (Test) "Test" 'Test'之间的空格
保留连字符Hello-World

C＃

https://rextester.com/GAZJS38767

// Title without spaces
string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";

// Detect where to space words
string[] split =  Regex.Split(title, "(?<!^)(?=(?<![.\\-'\"([{])[A-Z][\\d+]?)");

// Trim each word of extra spaces before joining
split = (from e in split
         select e.Trim()).ToArray();

// Join into new title
string newtitle = string.Join(" ", split);

// Display
Console.WriteLine(newtitle);

正则表达式

我在数字，方括号，括号和引号之前的空格有麻烦。

https://regex101.com/r/9IIYGX/1

(?<!^)(?=(?<![.\-'"([{])(?<![A-Z])[A-Z][\d+?]?)

(?<!^)          // Negative look behind

(?=             // Positive look ahead

(?<![.\-'"([{]) // Ignore if starts with punctuation
(?<![A-Z])      // Ignore if starts with double Uppercase letter
[A-Z]           // Space after each Uppercase letter
[\d+]?          // Space after number

)

解决方案

感谢您的共同努力。这是一个正则表达式示例。我将此应用到文件名，并且排除了特殊字符\/:*?"<>|。

https://rextester.com/FYEVE73725

https://regex101.com/r/xi8L4z/1

Answer 1

这是一个看起来不错的正则表达式，至少对于您的示例输入而言如此：

(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\W)(?=\W)

该木匠说要在以下条件之一的边界上进行分割：

先于小写，而先于大写（或反之亦然）
前面是数字，后面是字母（或反之亦然）
前面是什么，后面是一个非文字字符（例如引号，括号等）

string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";
string[] split =  Regex.Split(title, "(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\\W)(?=\\W)"); 
split = (from e in split select e.Trim()).ToArray();
string newtitle = string.Join(" ", split);

This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'

注意：您可能还希望将此断言添加到正则表达式替代中：

(?<=\W)(?=\w)|(?<=\w)(?=\W)

我们在这里避免了这种情况，因为这种边界条件从未发生过。但是您可能需要其他输入。

Answer 2

为了简化而不是庞大的正则表达式，我建议使用小的简单模式编写此代码（注释的注释在代码中）：

string str = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)\"Test\"'Test'[Test]";
// insert space when there is small letter followed by upercase letter
str = Regex.Replace(str, "(?<=[a-z])(?=[A-Z])", " ");
// insert space whenever there's digit followed by a ltter
str = Regex.Replace(str, @"(?<=\d)(?=[A-Za-z])", " ");
// insert space when there's letter followed by digit
str = Regex.Replace(str, @"(?<=[A-Za-z])(?=\d)", " ");
// insert space when there's one of characters ("'[ followed by letter or digit
str = Regex.Replace(str, @"(?=[(\[""'][a-zA-Z0-9])", " ");
// insert space when what preceeds is on of characters ])"'
str = Regex.Replace(str, @"(?<=[)\]""'])", " ");

Answer 3

前几部分类似于@revo answer：(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}，此外，我在数字和字母之间的空格中添加了以下正则表达式：(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])并检测{{1 }}，然后用lookahead进行替换，并在后面查找，以小写字母查找大写字母：OTPIsADevice

请注意，(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))是或运算符，它允许执行所有正则表达式。

正则表达式：|

Demo

更新

改进一点：

发件人：(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])|(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

进入：(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])，它们执行相同的操作。

(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d是OP comment的即兴创作，它为标点符号添加了例外：(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}])

最终正则表达式： (((?<!^)(?<!['([{])[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\\]}!&}])

Demo

Answer 4

您可以通过使用不同的解释来降低要求以缩短正则表达式的步骤。例如，第一个要求就是说 保留大写字母，如果它们前面没有标点符号或大写字母。

以下正则表达式几乎可以满足所有上述要求，并且可以扩展为包括或排除其他情况：

(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}

您必须使用Replace()方法并使用 $0作为替换字符串。

请参见live demo here

.NET（请参见in action）：

string input = @"ThisIsAnExample.TitleHELLO-WORLD2019T.E.S.T.(Test)""Test""'Test'[Test]";
Regex regex = new Regex(@"(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}", RegexOptions.Multiline);
Console.WriteLine(regex.Replace(input, @" $0"));

标题字符串分开，单词之间没有空格

解决方案

4 个答案:

更新