Question

我正在编写一个Ruby脚本，用于搜索文本文件中的马萨诸塞州城镇名称。我需要捕获任何匹配术语周围的一定数量的字符，并将它们保存为字符串。

例如，以下段落包含“斯普林菲尔德”一词。我需要捕获术语斯普林菲尔德，以及它周围的20个字符，并将整个摘录保存为字符串，摘录。

这是一个包含术语斯普林菲尔德的示例段落。该样本段落继续描述人口，人口统计和社区的旅游景点等。

结果应该是这样的：

excerpt =“t包括术语斯普林菲尔德。样本段落”

Answer 1

试试这个：

text = "This is a sample passage that includes the term Springfield. The sample passage goes on to describe the population, demographics and tourist attractions in the community etc."

search = "Springfield"
i = text.index(search)    

excerpt = text[i-20..i+20+search.size]
# => "t includes the term Springfield. The sample passage "

Answer 2

我认为这很接近你所寻找的，但你还没有给出所有的规则。特别是你没有说明如果"Springfield"之前或之后少于20个字符会发生什么。（我最多假设20个。）另外，你还没有说"Springfield"是否可以成为更长词的一部分。我认为它不能，但只是从正则表达式中删除单词break（\b），如果不是这样的话。另外，我join编辑了':'只是为了显示联接的位置，但您当然可以将其更改为''。

def extract(str)
  str.scan(/.{,20}\bSpringfield\b.{,20}/).join(':')
end

extract(text)
  #=> "t includes the term Springfield. The sample passage" 
extract("a Springfield 123456789012345678 Springfield b")
  #=> "a Springfield 123456789012345678 :Springfield b" 
extract("a bSpringfield 123456789012345678 Springfield b")
  #=> " 123456789012345678 Springfield b"

如果在第二个示例中，如果您希望在第二个Springfield之前显示（最多）20个字符，则可以使用String#scan形式的正向前瞻。这里块变量m是一个包含两个捕获组的值的数组（即m => [$1,$2]。注意，当提供一个块时，scan返回原始字符串，所以＆＃39;必须捕获数组中的匹配结果（此处为a）。

def extract(str)
  a = []
  str.scan(/(.{,20}\bSpringfield)\b(?=(.{,20}))/) { |m| a << m.join }
  a.join(':')
end

extract("a Springfield 123456789012345678 Springfield b")
  #=> "a Springfield 123456789012345678 : 123456789012345678 Springfield b"

在匹配术语之前和之后捕获字符

2 个答案: