Question

我对正则表达式很陌生。我正在使用正则表达式：

/\w+/

要检查单词，很明显这会有标点符号问题，但我不太清楚如何更改此正则表达式。例如，当我从我创建的类中运行此命令时：

Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display

我得到了输出：

-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------

如何调整正则表达式，使其与带有撇号的单词匹配，例如：不是作为一个单词，但在搜索时只匹配 ＆＃39; 或＆＃39; 。像堆栈溢出这样的单词中间的连字符应该分别匹配返回堆栈和溢出，这已经存在。

此外，单词不应该以数字开头或结尾，例如 test1241 或 436test 应该变为测试，但是 te7st 没关系。不应识别普通数字。

抱歉，我知道这是一个很大的问题，但我不知道从哪里开始使用正则表达式。如果你能解释一下表达意味着什么，将不胜感激。

Answer 1

str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"

r = /
    [[:alpha:]]            # match a letter
    (?:                    # begin the outer non-capture group
      (?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
      *                    # execute the above non-capture group zero or more times
      [[:alpha:]]          # match a letter
    )?                     # close the outer non-capture group and make it optional
    /x                     # free-spacing regex definition mode

str.scan r
  #=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]

注意，如果要匹配的字符串是单个字符，则需要外部捕获组。

嗯。也许我们应该在内部非捕获组中添加一个连字符。

r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
  #=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]

我现在很少使用单词匹配字符\w，主要是因为它匹配下划线，以及字母和数字。相反，我找到POSIX bracket expression（搜索＆＃34; POSIX＆＃34;），它具有增加的（可能是主要的）好处，它不是以英语为中心的。例如，匹配单词字符与下划线除外是[[:alnum:]]。

Answer 2

您可以使用以下方式执行基本操作：

/[a-z]+(?:'[a-z]+)*/i

要将其扩展为允许a2b之类的字词，并避免使用123abc abc123和/或普通数字：

/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i

这两种模式中没有使用特殊的正则表达式功能，只有基础。

Answer 3

尝试使用[[:alpha:]] POSIX字符类扫描字符串：

s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]

[首次尝试]

我将字符串拆分为由空格或连字符分隔的标记，然后根据您的规则清理每个标记，因为看起来它们可能会在您优化问题时进行调整：

def tokenize(str)
  tokens = str.split(/(?:\s+|-)/)
  tokens.reduce([]) do |memo, token|
    token.gsub!(/(^\W+|\W+$)/, '')    # Strip enclosing non-words
    token.gsub!(/(^\d+|\d+$)/, '')    # Strip enclosing digits
    memo + (token=='' ? [] : [token]) # Ignore the empty string
  end
end

s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
#   ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]

显然，这个解决方案并不仅仅使用正则表达式，但是为了我的钱，它更容易理解和修改（我想象的）一个大的正则表达式看起来像！

用于查找单词

3 个答案: