Question

我浏览了GitHub NLP中提供的以下NLP宝石，但无法找到正确的解决方案。

是否有任何宝石或库可用于根据给定的相似百分比对文本进行分组。所有上述宝石都有助于找到两个字符串之间的相似性，但需要花费大量时间来完成大量数据分组。

Answer 1

你可以通过使用Ruby加上一个列出的宝石来实现。

我之所以选择fuzzy-string-match是因为我喜欢这个名字

以下是使用gem的方法：

require 'fuzzystringmatch'

# Create the matcher
jarow = FuzzyStringMatch::JaroWinkler.create( :native )

# Get the distance
jarow.getDistance(  "jones",      "johnson" )
# => 0.8323809523809523

# Round it
jarow.getDistance(  "jones",      "johnson" ).round(2)
# => 0.83

由于您获得了浮点数，您可以使用round方法定义所需的精度。

现在，要对类似的结果进行分组，您可以使用group_by模块中的Enumerable方法。

你传递一个块，group_by将遍历集合。对于每次迭代，返回您尝试分组的值（在这种情况下，距离），它将返回一个散列，其中距离为键，字符串数组与作为值匹配的字符串。

require 'fuzzystringmatch'

jarow = FuzzyStringMatch::JaroWinkler.create( :native )

target = "jones"
precision = 2
candidates = [ "Jessica Jones", "Jones", "Johnson", "thompson", "john", "thompsen" ]

distances = candidates.group_by { |candidate|
  jarow.getDistance( target, candidate ).round(precision)
}

distances
# => {0.52=>["Jessica Jones"],
#     0.87=>["Jones"],
#     0.68=>["Johnson"],
#     0.55=>["thompson", "thompsen"],
#     0.83=>["john"]}

我希望这会有所帮助

根据给定的相似性百分比将批量文本分组为组

1 个答案: