Question

我需要从段落中提取包含单词island或Island的句子。每个句子都以大写字母开头，以句号结尾。

段落为字符串

" The islands were settled from the second century AD by a series of local empires. In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826. During World War II, Singapore was occupied by Japan. It gained independence from Britain in 1963, by uniting with other former British territories to form Malaysia, but was expelled two years later over ideological differences. After early years of turbulence, and despite lacking natural resources and a hinterland, the nation developed rapidly as an Asian Tiger economy, based on external trade and its human capital. "（来源：https://en.wikipedia.org/wiki/Singapore）

作为数组元素的理想结果：

这些岛屿于公元二世纪由一系列地方帝国定居。
1819年，史丹福莱佛士爵士创立了现代新加坡，成为东印度公司的贸易站;公司倒闭后，这些岛屿被割让给英国，并于1826年成为其海峡殖民地的一部分。

我找到了关于如何在其他语言中执行此操作的示例，例如Java（Regex to find sentence containing specific word (java) from paragraph）。但是，同样的Regex并不适用于Ruby。

这可以使用Ruby吗？

Answer 1

我可能没有正则表达式。当您稍后再回到代码时，它们很难阅读和理解。一个简单的分裂成句子然后根据关键词选择应该做：

input.split('.').select do |sentence|
  sentence.downcase.include?('island')
end

当然可能还有其他'。'在段落中不用于分隔句子。

Answer 2

我建议使用两个正则表达式，一个用于将字符串分解为句子，另一个用于提取包含单词＆＃34; island＆＃34;或者＆＃34; islands＆＃34;，第一个字母可能大写。

str.split(/(?<=\.)\s+/).select { |s| s =~ /\b[iI]slands?\b/ }
  #=> ["The islands were settled from the second century AD by a series of local empires.",
  #    "In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of
  #     the East India Company; after the company collapsed, the islands were ceded to
  #     Britain and became part of its Straits Settlements in 1826. *

/(?<=\.)\s+/匹配一个正面观察后跟一个或多个空格的句号。
/\b[iI]slands?\b/匹配字符串＆＃34; island＆＃34;，＆＃34; Island＆＃34;，＆＃34; islands＆＃34;和＃34;群岛＆＃34;，被分词包围（以避免匹配，例如＆＃34; islander＆＃34;）。

^{*我在这里添加了两个换行符以使其更具可读性。}

Answer 3

是。按照你的说法，最直截了当的可能是：

string.scan(/(?=[A-Z])[^.]*island[^.]*\./i)
# => [
#   "The islands were settled from the second century AD by a series of local empires.",
#   "In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826."
# ]

Answer 4

此解决方案为示例文本生成正确的结果。

text = " The islands were settled from the second century AD by a series of local empires. In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826. During World War II, Singapore was occupied by Japan. It gained independence from Britain in 1963, by uniting with other former British territories to form Malaysia, but was expelled two years later over ideological differences. After early years of turbulence, and despite lacking natural resources and a hinterland, the nation developed rapidly as an Asian Tiger economy, based on external trade and its human capital."

matches = text.scan(/\b[A-Z][^.]+[Ii]sland[^.]+?\./)

matches.each do |match|
  puts "Found: #{match}"
end

这会产生以下输出：

Found: The islands were settled from the second century AD by a series of local empires.
Found: In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826.

根据提供的链接，可以添加对其他句子终止符的额外支持（例如＆＃34;！＆＃34;和＃34;？＆＃34;），只需稍加改动即可添加：

matches = text.scan(/\b[A-Z][^.!?]+[Ii]sland[^.!?]+?[.!?]/)

Answer 5

您可以使用此正则表达式

(?<=^|[.?!])(.*?[Ii]sland.*?(?:[.?!]|$))

<强> Rubular Demo

Ruby代码

print str.scan(/(?<=^|[.?!])(.*?[Ii]sland.*?(?:[.?!]|$))/)

<强> Ideone Demo

从段落

5 个答案: