我找到Levenshtein
编辑距离算法(通过damerau-levenshtein
宝石),我觉得它很适合我的目的。
此代码将每个元素与数组中的每个其他元素进行比较,将每个比较的结果添加到一组哈希值,这些哈希值将按:distance
键排序。
当使用此代码时,数组中的数据是来自java服务的日志,因此大的编辑距离显示哪些日志与其他日志相比最独特。
输入数据采用以下形式:
["Failed to process service event Error: 404 Not Found", "Failed to process service event Error: Resource not found in Storage service", "Throughput exceeded for table test-us-east-1-service-table."]
def get_edit_distances(arr)
if arr.empty?
return []
end
if arr.length == 1
return [arr[0]]
end
dl = DamerauLevenshtein
results = Set.new
i = 0 #array position
while i < arr.length
j = i + 1 #element to compare arr[i] against
while j < arr.length
results.add({message: arr[i], distance: dl.distance(arr[i], arr[j], 1, 256)})
#This is to make sure we have every element in the final results
if j+1 == arr.length
results.add({message: arr[j], distance: dl.distance(arr[0], arr[j], 1, 256)})
break
end
j += 1 #increment
end
i += 1
end
final_results = results.to_a
#sort in descending order by distance
final_results.sort! {|a,b| b[:distance] <=> a[:distance]}
#remove duplicates of messages now that everything is sorted
final_results.uniq! {|m| m[:message]}
#return array of messages
final_results.map {|r| r[:message]}
end
此代码的输出是消息数组,按唯一性排序:
["Throughput exceeded for table test-us-east-1-service-table.", "Failed to process service event Error: Resource not found in Storage service", "Failed to process service event Error: 404 Not Found"]
对于928
个元素的数组(通常会有~100,000个),我得到了11801
个元素的输出(单个结果有多个编辑距离,set
防止相同距离的重复消息。)
整个循环的基准测试结果:
user system total real
Edit Dist Loop: 62.260000 0.110000 62.370000 ( 62.456783)
问题:有没有更好的方法来创建按唯一性排序的排序数组/唯一元素集?
答案 0 :(得分:0)
希望我能正确理解您的原始问题,&#34;按唯一性排序日志消息数组&#34;;找到最罕见的日志。
如果是这种情况,请尝试:
def sort_by_uniqueness(arr)
h = {}
arr.each do |entry|
a[entry] = 0 unless a.key?(entry)
a[entry] += 1
end
a.sort_by { |k, v| v }.map(&:first)
end