如何使用RegExp获取指定数量的带有特殊字符的单词?

时间:2014-09-21 21:55:23

标签: ruby regex markov-chains

我目前正在使用Ruby中的Markov chain text generator应用程序,它接收文本的正文(“语料库”),然后基于此生成新文本。我当前需要解决的问题是编写一个Regexp,它将返回包含我指定的单词数的数组。我想在这里做的就是获取一定数量的单词(由用户指定),但在整个字符串中多次。

关闭我见过的另一个应用程序,我正在使用类似/(([.,?"();\-!':—^\w]+ ){#{depth}})/的内容,其中#{depth}一次插入我想要多少个单词。这应该一次抓住两个单词,同时允许一个特殊字符的子集,这就是让我感觉到的那一部分。所以总的问题是:如何动态指定我想要的单词数量(用空格分隔),同时还允许这些单词中的一系列特殊字符?

这是我目前所拥有的:

# Regex
@match_regex = /(([.,?"();\-!':—^\w]+ ){2})/
s = input.scan(@match_regex).to_a
puts s.inspect

# Input
Within weeks they planned a meeting. She sent him poetry along with her itinerary,
having worked in a business meeting to excuse the opportunity. He prepared flowers
and a banner of welcome on his hearth. 

# Output - seems to be grabbing last word again for some reason
[["Within weeks ", "weeks "], ["they planned ", "planned "], ["a meeting. ", "meeting. "],
["She sent ", "sent "], ["him poetry ", "poetry "], ["along with ", "with "],
["her itinerary, ", "itinerary, "], ["having worked ", "worked "], ["in a ", "a "],
["business meeting ", "meeting "], ["to excuse ", "excuse "],
["the opportunity. ", "opportunity. "], ["He prepared ", "prepared "], ["flowers and ", "and "],
["a banner ", "banner "], ["of welcome ", "welcome "], ["on his ", "his "]]

# Desired output. I'm not picky if it has trailing spaces or not as I can always trim that
["Within weeks", "they planned", "a meeting.", "She sent", "him poetry", "along with",
"her itinerary," "having worked", "in a", "business meeting", "to excuse", "the opportunity.",
"He prepared", "flowers and", "a banner", "of welcome", "on his"]

非常感谢任何帮助。谢谢!

2 个答案:

答案 0 :(得分:0)

在正则表达式中,每组括号都会创建一个捕获组,对于输入中找到的每个匹配,Ruby返回这些组的列表。

您有两组括号:第一个围绕整个表达式,第二个围绕每个单词(请注意,对于重复捕获组(例如(foo){x}),仅返回最后一个实例)。因此每场比赛有两个项目列表。

要获得您想要的内容,您需要删除这些捕获组。第一组可以简单地删除,对于第二组你想要使它成为非捕获组,为此你用?:开始括号。 因此,您需要的表达式为:

@match_regex = /(?:[.,?"();\-!':—^\w]+ ){2}/

答案 1 :(得分:0)

如果我理解你的问题,我认为这应该适合你:

def split_it(text, num_words, special_chars)
  text.scan(/(?:[\w#{special_chars}]+(?:\s+|$)){#{num_words}}/)
end

text =<<_
Within weeks they planned a meeting. She sent him poetry along with her itinerary,
having worked in a business meeting to excuse the opportunity. He prepared flowers
and a banner of welcome on his hearth.
_

special_chars = ".,?\"();\\-!':"

split_it(text, 2, special_chars)
  #=> ["Within weeks ", "they planned ", "a meeting. ", "She sent ", "him poetry ",
  #    "along with ", "her itinerary,\n", "having worked ", "in a ",
  #    "business meeting ", "to excuse ", "the opportunity. ", "He prepared ",
  #    "flowers\nand ", "a banner ", "of welcome ", "on his "]
split_it(text, 3, special_chars)
  #=> ["Within weeks they ", "planned a meeting. ", "She sent him ",
  #    "poetry along with ", "her itinerary,\nhaving ", "worked in a ",
  #    "business meeting to ", "excuse the opportunity. ", "He prepared flowers\n",
  #    "and a banner ", "of welcome on "]

注意\\-中的special_chars。如果你有-\-,它将出现在正则表达式中的括号-之间,而Ruby会期望你定义一个范围,并会引发异常。额外的反斜杠导致\-出现在括号之间,告诉Ruby它是文字-。 @Amadan指出,如果-位于字符串的开头或结尾,则不需要擒纵。

马尔可夫链?嗯。