Ruby字符串扫描为不同的字符串返回不同的结果

时间:2017-08-07 15:43:21

标签: ruby string tokenize

irb(main):161:0>  "Ready for your my next session?".scan(/[A-Za-z]+|\d+|. /)
=> ["Ready", "for", "your", "my", "next", "session"]
=> ["Ready", "for", "your", "my", "next", "session", "?"] #==> EXPECTED
irb(main):162:0> "yo mr. menon how are you? call at 9 a.m. \"okay\"".scan(/[A-Za-z]+|\d+|. /)
=> ["yo", "mr", ". ", "menon", "how", "are", "you", "? ", "call", "at", "9", "a", "m", ". ", "okay"]
=> ["yo", "mr", ". ", "menon", "how", "are", "you", "? ", "call", "at", "9", "a",".", "m", ".", "``", "okay", "''"] #==> EXPECTED

我正在尝试使用此scan(/[A-Za-z]+|\d+|. /)来标记字符串甚至是标点符号,即使字符串中存在转义引号,\"

但它在字符串的不同结构上表现不同?怎么纠正?

1 个答案:

答案 0 :(得分:1)

r = /
    (?:          # begin a non-capture group
      \"?        # optionally (?) match a double-quote
      \p{alpha}+ # match one or more letters
      \"?        # optionally (?) match a double-quote
    )            # end non-capture group
    |            # or
    \d+          # match one or more digits
    |            # or
    [.,?!:;]     # match a punctuation mark
    /x           # free-spacing regex definition mode

"yo mr. menon how are you? call at 9 a.m. \"okay\"".scan(r)
   #=> ["yo", "mr", ".", "menon", "how", "are", "you", "?", "call", "at", "9",
   #    "a", ".", "m", ".", "\"okay\""]
puts "\"okay\""
   # "okay"

正则表达式通常是

/(?:\"?\p{alpha}+\"?)|\d+|[.,?!:;]/