Question

是否有将字符串数组中的第一个常用字母分组？

例如：

 array = [ 'hello', 'hello you', 'people', 'finally', 'finland' ]

所以当我做的时候

array.group_by{ |string| some_logic_with_string }

结果应该是，

{ 
   'hello' => ['hello', 'hello you'],
   'people' => ['people'],
   'fin' => ['finally', 'finland']
}

Answer 1

注意：某些测试用例不明确，期望与其他测试冲突，您需要修复它们。

我认为普通group_by可能不起作用，需要进一步处理。

我已经提出了以下代码，这些代码似乎以一致的方式适用于所有给定的测试用例。

我在代码中留下了笔记来解释逻辑。完全理解它的唯一方法是检查h的值并查看简单测试用例的流程。

def group_by_common_chars(array)
    # We will iteratively group by as many time as there are characters
    # in a largest possible key, which is max length of all strings
    max_len = array.max_by {|i| i.size}.size

    # First group by first character.
    h = array.group_by{|i| i[0]}

    # Now iterate remaining (max_len - 1) times
    (1...max_len).each do |c|
        # Let's perform a group by next set of starting characters.
        t = h.map do |k,v|
            h1 = v.group_by {|i| i[0..c]} 
        end.reduce(&:merge)

        # We need to merge the previously generated hash
        # with the hash generated in this iteration.  Here things get tricky.
        # If previously, we had 
        #    {"a" => ["a"], "ab" => ["ab", "abc"]},
        # and now, we have 
        #    {"a"=>["a"], "ab"=>["ab"], "abc"=>["abc"]},
        # We need to merge the two hashes such that we have
        #    {"a"=>["a"], "ab"=>["ab", "abc"], "abc"=>["abc"]}.
        # Note that `Hash#merge`'s block is called only for common keys, so, "abc"
        # will get merged, we can't do much about it now.  We will process
        # it later in the loop    
        h = h.merge(t) do |k, o, n| 
            if (o.size != n.size)
                diff = [o,n].max - [o,n].min
                if diff.size == 1 && t.value?(diff)
                    [o,n].max
                else
                    [o,n].min
                end
            else
                o
            end
        end
    end

    # Sort by key length, smallest in the beginning.
    h = h.sort {|i,j| i.first.size <=> j.first.size }.to_h

    # Get rid of those key-value pairs, where value is single element array
    # and that single element is already part of another key-value pair, and
    # that value array has more than one element.  This step will allow us
    # to get rid of key-value like "abc"=>["abc"] in the example discussed
    # above.

    h = h.tap do |h|
        keys = h.keys
        keys.each do |k|
            v = h[k]    
            if (v.size == 1 && 
                h.key?(v.first) && 
                h.values.flatten.count(v.first) > 1) then
                h.delete(k)
            end
        end
    end

    # Get rid of those keys whose value array consist of only elements that
    # already part of some other key.  Since, hash is ordered by key's string 
    # size, this process allows us to get rid of those keys which are smaller 
    # in length but consists of only elements that are present somewhere else
    # with a key of larger length.  For example, it lets us to get rid of 
    # "a"=>["aba", "abb", "aaa", "aab"] from a hash like
    # {"a"=>["aba", "abb", "aaa", "aab"], "ab"=>["aba", "abb"], "aa"=>["aaa", "aab"]}
    h.tap do |h|
        keys = h.keys
        keys.each do |k|
            values = h[k]
            other_values = h.values_at(*(h.keys-[k])).flatten
            already_present = values.all? do |v|
                other_values.include?(v)
            end
            h.delete(k) if already_present
        end
    end
end

示例运行：

p group_by_common_chars ['hello', 'hello you', 'people', 'finally', 'finland']
#=> {"fin"=>["finally", "finland"], "hello"=>["hello", "hello you"], "people"=>["people"]}

p group_by_common_chars ['a', 'ab', 'abc']
#=> {"a"=>["a"], "ab"=>["ab", "abc"]}

p group_by_common_chars  ['aba', 'abb', 'aaa', 'aab']
#=> {"ab"=>["aba", "abb"], "aa"=>["aaa", "aab"]}

p group_by_common_chars ["Why", "haven't", "you", "answered", "the", "above", "questions?", "Please", "do", "so."]
#=> {"a"=>["answered", "above"], "do"=>["do"], "Why"=>["Why"], "you"=>["you"], "so."=>["so."], "the"=>["the"], "Please"=>["Please"], "haven't"=>["haven't"], "questions?"=>["questions?"]}

Answer 2

不确定，是否可以按所有常用字母排序。但是如果你只想用第一个字母排序，那么它就是：

array = [ 'hello', 'hello you', 'people', 'finally', 'finland' ]    
result = {}
array.each { |st| result[st[0]] = result.fetch(st[0], []) + [st] }

pp result
{"h"=>["hello", "hello you"], "p"=>["people"], "f"=>["finally", "finland"]}

现在result包含您想要的哈希值。

Answer 3

嗯，你正在尝试做一些非常习惯的事情。我可以想到两种经典方法，它们可以做你想做的事情：1）Stemming和2）Levenshtein Distance。

通过词干你可以找到更长词的词根。这是一个gem。

Levenshtein是一种着名的算法，可以计算两个字符串之间的差异。由于本机C扩展，它有一个gem，运行速度非常快。

字符串数组按第一个常用字母分组

3 个答案: