按"唯一性排序字符串数组"

时间:2016-01-04 20:37:01

标签: arrays ruby sorting set edit-distance

我找到Levenshtein编辑距离算法(通过damerau-levenshtein宝石),我觉得它很适合我的目的。

此代码将每个元素与数组中的每个其他元素进行比较,将每个比较的结果添加到一组哈希值,这些哈希值将按:distance键排序。

当使用此代码时,数组中的数据是来自java服务的日志,因此大的编辑距离显示哪些日志与其他日志相比最独特。

输入数据采用以下形式:
["Failed to process service event Error: 404 Not Found", "Failed to process service event Error: Resource not found in Storage service", "Throughput exceeded for table test-us-east-1-service-table."]

def get_edit_distances(arr)
  if arr.empty?
    return []
  end
  if arr.length == 1
    return [arr[0]]
  end
  dl = DamerauLevenshtein
  results = Set.new
  i = 0 #array position
  while i < arr.length
    j = i + 1 #element to compare arr[i] against

    while j < arr.length
      results.add({message: arr[i], distance: dl.distance(arr[i], arr[j], 1, 256)})

      #This is to make sure we have every element in the final results
      if j+1 == arr.length 
        results.add({message: arr[j], distance: dl.distance(arr[0], arr[j], 1, 256)})
        break
      end

      j += 1 #increment 
    end
    i += 1
  end
  final_results = results.to_a
  #sort in descending order by distance
  final_results.sort! {|a,b| b[:distance] <=> a[:distance]}
  #remove duplicates of messages now that everything is sorted
  final_results.uniq! {|m| m[:message]}
  #return array of messages
  final_results.map {|r| r[:message]}
end

此代码的输出是消息数组,按唯一性排序:
["Throughput exceeded for table test-us-east-1-service-table.", "Failed to process service event Error: Resource not found in Storage service", "Failed to process service event Error: 404 Not Found"]

对于928个元素的数组(通常会有~100,000个),我得到了11801个元素的输出(单个结果有多个编辑距离,set防止相同距离的重复消息。)

整个循环的基准测试结果:

                    user      system     total       real
Edit Dist Loop:  62.260000   0.110000  62.370000 ( 62.456783)

问题:有没有更好的方法来创建按唯一性排序的排序数组/唯一元素集?

1 个答案:

答案 0 :(得分:0)

希望我能正确理解您的原始问题,&#34;按唯一性排序日志消息数组&#34;;找到最罕见的日志。

如果是这种情况,请尝试:

def sort_by_uniqueness(arr)
  h = {}
  arr.each do |entry|
    a[entry] = 0 unless a.key?(entry)
    a[entry] += 1 
  end
  a.sort_by { |k, v| v }.map(&:first)
end