用于查找单词

时间:2017-01-24 18:39:16

标签: ruby regex

我对正则表达式很陌生。我正在使用正则表达式:

/\w+/

要检查单词,很明显这会有标点符号问题,但我不太清楚如何更改此正则表达式。例如,当我从我创建的类中运行此命令时:

Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display

我得到了输出:

-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------

如何调整正则表达式,使其与带有撇号的单词匹配,例如:不是作为一个单词,但在搜索时只匹配 ' ' 。像堆栈溢出这样的单词中间的连字符应该分别匹配返回堆栈溢出,这已经存在。

此外,单词不应该以数字开头或结尾,例如 test1241 436test 应该变为测试,但是 te7st 没关系。不应识别普通数字。

抱歉,我知道这是一个很大的问题,但我不知道从哪里开始使用正则表达式。如果你能解释一下表达意味着什么,将不胜感激。

3 个答案:

答案 0 :(得分:2)

str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"

r = /
    [[:alpha:]]            # match a letter
    (?:                    # begin the outer non-capture group
      (?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
      *                    # execute the above non-capture group zero or more times
      [[:alpha:]]          # match a letter
    )?                     # close the outer non-capture group and make it optional
    /x                     # free-spacing regex definition mode

str.scan r
  #=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]

注意,如果要匹配的字符串是单个字符,则需要外部捕获组。

嗯。也许我们应该在内部非捕获组中添加一个连字符。

r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
  #=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]

我现在很少使用单词匹配字符\w,主要是因为它匹配下划线,以及字母和数字。相反,我找到POSIX bracket expression(搜索" POSIX"),它具有增加的(可能是主要的)好处,它不是以英语为中心的。例如,匹配单词字符与下划线除外是[[:alnum:]]

答案 1 :(得分:1)

您可以使用以下方式执行基本操作:

/[a-z]+(?:'[a-z]+)*/i

要将其扩展为允许a2b之类的字词,并避免使用123abc abc123和/或普通数字:

/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i

这两种模式中没有使用特殊的正则表达式功能,只有基础。

答案 2 :(得分:1)

尝试使用[[:alpha:]] POSIX字符类扫描字符串:

s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]

[首次尝试]

我将字符串拆分为由空格或连字符分隔的标记,然后根据您的规则清理每个标记,因为看起来它们可能会在您优化问题时进行调整:

def tokenize(str)
  tokens = str.split(/(?:\s+|-)/)
  tokens.reduce([]) do |memo, token|
    token.gsub!(/(^\W+|\W+$)/, '')    # Strip enclosing non-words
    token.gsub!(/(^\d+|\d+$)/, '')    # Strip enclosing digits
    memo + (token=='' ? [] : [token]) # Ignore the empty string
  end
end

s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
#   ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]

显然,这个解决方案并不仅仅使用正则表达式,但是为了我的钱,它更容易理解和修改(我想象的)一个大的正则表达式看起来像!