Question

我有以下正则表达式，它将捕获前N个单词，并在下一个句点，感叹号或问号处结束。我需要获取单词数量不同的大块文本，但我需要完整的句子。

regex = (?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]

它适用于以下文本：

疗法仅从虾壳中提取秸秆和壳聚糖分别占2、4、6、8和10％，发现提取秸秆8％是高效抑制藻类微囊藻的生长。在此过程中，细胞数量减少，叶绿素a减少治疗。直到试验结束，这两个值都持续下降。

https://regex101.com/r/ardIQ7/5

但是它不适用于以下文本：

仅从虾壳中提取的提取物秸秆和壳聚糖 2％，4％，6％，8％和10％的人发现提取秸秆8.2％有效抑制藻类微囊藻的生长。的细胞数量和叶绿素a的数量减少了治疗。直到试验结束，这两个值都持续下降。

那是因为有小数点和％的数字（8.2％）。

我一直试图弄清楚如何也捕获这些物品，但需要一些帮助以指出正确的方向。我不只是想抓住第一句话。我想捕获N个单词，其中可能包含几个句子并返回完整的句子。

Answer 1

尝试一下，(?:\S+[,.?!]?\s+){1,200}[\s\S]*?(\. |!|\?)

这将匹配N个字符。

如果第N个字符没有结束句子，则它将匹配直到前一个句子。 N应该提到为{1, N}

Regex

Answer 2

r = /
    (?:           # begin a non-capture group
      (?:           # begin a non-capture group
        \p{Alpha}+  # match one or more letters
      |           # or
        \-?       # optionally match a minus sign
        (?:       # begin non-capture group
          \d+     # match one or more digits
        |         # or
          \d+     # match one or more digits
          \.      # match a decimal point
          \d+     # match one or more digits
        )         # end non-capture group
        %?        # optionally match a percentage character
      )           # end non-capture group
      [,;:.!?]?   # optionally ('?' following ']') match a punctuation char
      [ ]+        # match one or more spaces      
    )             # end non-capture group
    {9,}?         # execute the preceding non-capture group at least 14 times, lazily ('?')
    (?:           # begin a non-capture group
      \p{Alpha}+  # match one or more letters
      |           # or
      \-?         # optionally match a minus sign
        (?:       # begin non-capture group
          \d+     # match one or more digits
        |         # or
          \d+     # match one or more digits
          \.      # match a decimal point
          \d+     # match one or more digits
        )         # end non-capture group
      %?          # optionally match a percentage character
    )             # end non-capture group  
    [.!?]         # match one of the three punctuation characters
    (?!\S)        # negative look-ahead: do not match a non-whitespace char
    /x            # free-spacing regex definition mode

让text等于您要检查的段落（“提取治疗秸秆...试验结束。”）

然后

text[r]
  #=> "Therapy extract straw and chitosan from...the growth of algae Microcystis spp."

我们可以按以下方式简化正则表达式的构造（并避免重复位）。

def construct_regex(min_nbr_words)
  common_bits = /(?:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)/
  /(?:#{common_bits}[,;:.!?]? +){#{min_nbr_words},}?#{common_bits}[.!?](?!\S)/
end

r = construct_regex(10)
  #=> /(?:(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[,;:.!?]? +){10,}?(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[.!?](?!\S)/

如果该正则表达式可以匹配"ab2.3e%"或"2.3.2%"之类的废话，则可以简化该正则表达式。按照目前的定义，正则表达式将不匹配这些单词。

改进我的正则表达式，使其包含包含小数和百分号的数字

2 个答案: