Question

如何在字符串中分隔不同的字符集？例如，如果我有这些字符集：

[a-z]
[A-Z]
[0-9]
[\s]
{everything else}

这个输入：

thisISaTEST***1234pie

然后我想分隔不同的字符集，例如，如果我使用换行符作为分隔字符：

this
IS
a
TEST
***
1234
pie

我试过这个正则表达式，前瞻性很好：

'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")

但显然+并不贪婪，因为我得到了：

t
h
# (snip)...
S
T***
1
# (snip)...
e

我剪掉了不相关的部分，但是你可以看到每个角色都算作自己的字符集，除了{everything else}字符集。

我该怎么做？它不一定是正则表达式。将它们拆分成数组也可以。

Answer 1

困难的部分是匹配任何与正则表达式的其余部分不匹配的东西。忘记这一点，想一想你可以将不匹配的部分与匹配的部分混合在一起。

"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]

Answer 2

在ASCII字符集中，除了字母数字和空格外，还有32个“标点符号”字符，它们与属性构造\p{punct}匹配。

要将字符串拆分为单个类别的序列，可以编写

str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)

<强>输出

["this", "IS", "a", "TEST", "***", "1234", "pie"]

或者，如果您的字符串包含ASCII集之外的字符，您可以根据属性编写整个内容，例如

p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)

Answer 3

这里有两个解决方案。

String#scan（带有正则表达式）

str = "thisISa\n TEST*$*1234pie"

r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
  #=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]

由于^在[^a-zA-Z\d\s]的开头，字符类与除字母（小写和大写），数字和空格之外的任何字符相匹配。

使用Enumerable#slice_when ¹

首先，一个辅助方法：

def type(c) case c when /[a-z]/ then 0 when /[A-Z]/ then 1 when /\d/ then 2 when /\s/ then 3 else 4 end end

例如，

type "f" #=> 0 type "P" #=> 1 type "3" #=> 2 type "\n" #=> 3 type "*" #=> 4

然后

str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join) #=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

^{1。 slich_when在Ruby v2.4中首次亮相。}

Answer 4

[^\w\s]可以涵盖非单词，非空格字符，所以：

"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

在Ruby中按字符集分区/拆分字符串

4 个答案: