如何在字符串中的大写单词的任一侧拉出大写单词和1..3个单词

时间:2016-10-16 02:47:36

标签: ruby regex string

我想:

  • 找到大写字样。
  • 将每个单词提取到数组中的元素中。
  • 之前还提取了1-3个单词。在最初的单词之后,作为同一元素的一部分。

另外,我想:

  • 重复元素 - 我知道这可能会导致一些重复。没关系。我可以稍后再说。理想情况下,我希望不会有重复,所以如果有一种方法可以在这里重复删除,那么这将是非常棒的,但它不是主要焦点。
  • 在某些情况下,大写单词后跟一些标点符号或符号,因此符号后面的单词不应包含在该数组元素中。

如果我有一个字符串let yourArray: [JSON] = [] for element in yourArray { yourUploadFunc(element) }

words

可接受的数组结果如下所示:

Welcome\r\n        About\r\n    Hello, I'm John Van der Lyn and welcome to our website. We try to tailor our services to your specific needs, provide personal attention and someone to call with answers to your tax and financial questions and issues throughout the year. We believe in establishing long-term relationships with our clients and in providing good ole fashion service.\r\n            \r\n\r\n     We provide all levels of services for individuals with their tax and financial needs as well as Personal Representatives of Estates, or Trustees or beneficiaries of 

更好,更理想的结果如下所示:

["Welcome About Hello", "Welcome About Hello I'm", "About Hello I'm John", "Hello I'm John Van", "I'm John Van der Lyn", etc.]

完美而特殊(虽然复杂得多)的结果如下:

["Welcome About Hello", "I'm John Van der Lyn", "We try to", etc.]

我尝试使用["Welcome", "About", "Hello", "I'm John Van der Lyn", etc.] ,但我无法弄清楚如何根据正则表达式的规则将正则表达式传递给split字符串。我也无法弄清楚如何将每个元素分成四个单词,而不是一个单词。

2 个答案:

答案 0 :(得分:1)

words = str.scan(/([\w\'\-]+)*/).flatten.compact

>> ["Welcome", "About", "Hello", "I'm", "John", "Van", "der", "Lyn", "and", "welcome", "to", "our", "website", "We", "try", "to", "tailor", "our", "services", "to", "your", "specific", "needs", "provide", "personal", "attention", "and", "someone", "to", "call", "with", "answers", "to", "your", "tax", "and", "financial", "questions", "and", "issues", "throughout", "the", "year", "We", "believe", "in", "establishing", "long-term", "relationships", "with", "our", "clients", "and", "in", "providing", "good", "ole", "fashion", "service", "We", "provide", "all", "levels", "of", "services", "for", "individuals", "with", "their", "tax", "and", "financial", "needs", "as", "well", "as", "Personal", "Representatives", "of", "Estates", "or", "Trustees", "or", "beneficiaries", "of"]

words.each_with_index do |word, i|
  if word[0].match(/[A-Z]/)
    tmp = []
    tmp << words[i-2] unless i-2 < 0
    tmp << words[i-1] unless i-1 < 0
    tmp << word
    tmp << words[i+1]
    tmp << words[i+2]
    word_groups << tmp
  end
end

>> [["Welcome", "About", "Hello"], ["Welcome", "About", "Hello", "I'm"], ["Welcome", "About", "Hello", "I'm", "John"], ["About", "Hello", "I'm", "John", "Van"], ["Hello", "I'm", "John", "Van", "der"], ["I'm", "John", "Van", "der", "Lyn"], ["Van", "der", "Lyn", "and", "welcome"], ["our", "website", "We", "try", "to"], ["the", "year", "We", "believe", "in"], ["fashion", "service", "We", "provide", "all"], ["well", "as", "Personal", "Representatives", "of"], ["as", "Personal", "Representatives", "of", "Estates"], ["Representatives", "of", "Estates", "or", "Trustees"], ["Estates", "or", "Trustees", "or", "beneficiaries"]]

word_groups.map { |grp| grp.join(' ') }

>> ["Welcome About Hello", "Welcome About Hello I'm", "Welcome About Hello I'm John", "About Hello I'm John Van", "Hello I'm John Van der", "I'm John Van der Lyn", "Van der Lyn and welcome", "our website We try to", "the year We believe in", "fashion service We provide all", "well as Personal Representatives of", "as Personal Representatives of Estates", "Representatives of Estates or Trustees", "Estates or Trustees or beneficiaries"]

答案 1 :(得分:0)

  1. 这可能没有解决方案。

  2. 如果您对如何匹配名称有严格的模式,那么它或多或少是可以解决的。

  3. 让我们假装我们有一个名字匹配器。在我们的例子中,它将是:名称最多包含4个单词,其中至少2个大写(第一个和最后一个),名称不能包含奇怪的符号,如“。”。

    matcher = ->(words) do
      words.first =~ /\A\p{Lu}/ && # first in capitalized
      words.last =~ /\A\p{Lu}/ &&  # last in capitalized
      words.all?(&/\A\p{L}+\z/.method(:=~)) # letters only
    end
    

    这里我们使用正确的unicode character matchers。现在我们可以筛选我们的意见:

    (2..4).map { |i| input.split(/\s+/).each_cons(i).select(&matcher) }
          .reduce(&:|)
    

    以上将返回

    #⇒ [["Welcome", "About"], ["John", "Van"], 
    #   ["Personal", "Representatives"], ["Van", "der", "Lyn"], 
    #   ["John", "Van", "der", "Lyn"]]
    

    现在我们可以删除“弱”重复,但我已将此作为作业。