输入:
val input = List((a, 10 Inches), (a, 10.00 inches), (a, 15 in), (b, 2 cm), (b, 2.00 CM))
我喜欢输出
val output = List((a, 10 Inches, 0.66), (b, 2 cm, 1))
我还有一个效用函数,对于模糊匹配(“10英寸”,“10.00英寸”)返回true
fuzzyMatch(s1, s2) returns
true for s1 = "10 Inches" and s2 = "10.00 inches"
false for s1 = "10 Inches" and s2 = "15 in"
false for s1 = "10.00 inches" and s2 = "15 in"
true for s1 = "2 cm" and s2 = "2.00 CM"
Output = List of (unique_name, max occurred string value, (max number of occurrences/total occurrences))
如何将上述输入减少到输出
到目前为止我有什么
val tupleMap = input.groupBy(identity).mapValues(_.size)
val totalOccurrences = input.groupBy(_._1).mapValues(_.size)
val maxNumberOfValueOccurrences = tupleMap.groupBy(_._1._1).mapValues(_.values.max)
val processedInput = tupleMap
.filter {
case (k, v) => v == maxNumberOfValueOccurrences(k._1)
}
.map {
case (k, v) => (k._1, k._2, v.toDouble / totalOccurrences(k._1))
}.toSeq
给出完全匹配的比率。我如何适应我的模糊匹配,以便将所有相似值组合并计算比率?模糊匹配值可以是任何匹配。
它本质上是一个自定义组,使用我的fuzzyMatch(...)方法。但我想不出这里的解决方案。
经过一番思考,我得到了类似下面的内容。更好的解决方案将不胜感激。
val tupleMap: Map[String, Seq[String]] = input.groupBy(_._1).mapValues(_.map(_._2))
val result = tupleMap mapValues {
list =>
val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()
list foreach {
value =>
// Using fuzzy match to find the best match
// findBestMatch (uses fuzzyMatch) returns the Option(key)
// if there exists a similar key, if not returns None
val bestMatch = findBestMatch(value, valueCountsMap.keySet.toSeq)
if (bestMatch.isDefined) {
val newValueCount = valueCountsMap.getOrElse(bestMatch.get, 0) + 1
valueCountsMap(bestMatch.get) = newValueCount
} else {
valueCountsMap(value) = 1
}
}
val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
(maxOccurredValueNCount._1, maxOccurredValueNCount._2)
}
答案 0 :(得分:2)
如果出于某种原因,转换为数值的方法不适合您,这里的代码似乎可以满足您的需求:
def fuzzyMatch(s1: String, s2: String): Boolean = {
// fake implementation
val matches = List(("15 Inches", "15.00 inches"), ("2 cm", "2.00 CM"))
s1.equals(s2) || matches.exists({
case (m1, m2) => (m1.equals(s1) && m2.equals(s2)) || (m1.equals(s2) && m2.equals(s1))
})
}
def test(): Unit = {
val input = List(("a", "15 Inches"), ("a", "15.00 inches"), ("a", "10 in"), ("b", "2 cm"), ("b", "2.00 CM"))
val byKey = input.groupBy(_._1).mapValues(l => l.map(_._2))
val totalOccurrences = byKey.mapValues(_.size)
val maxByKey = byKey.mapValues(_.head) //random "max" selection logic
val processedInput: List[(String, String, Double)] = maxByKey.map({
case (mk, mv) =>
val matchCount = byKey(mk).count(tv => fuzzyMatch(tv, mv))
(mk, mv, matchCount / totalOccurrences(mk).asInstanceOf[Double])
})(breakOut)
println(processedInput)
}
打印
列表((b,2 cm,1.0),(a,15英寸,0.6666666666666666))
答案 1 :(得分:1)
这是一种使用模糊匹配预处理input
的方法,然后将其用作现有代码的输入。
我们的想法是首先生成input
元组的2组合,模糊匹配它们以创建由每个键的匹配值组成的不同集合的Map,最后使用Map模糊匹配您的原input
。
为确保涵盖更多任意案例,我已扩展您的input
:
val input = List(
("a", "10 in"), ("a", "15 in"), ("a", "10 inches"), ("a", "15 Inches"), ("a", "15.00 inches"),
("b", "2 cm"), ("b", "4 cm"), ("b", "2.00 CM"),
("c", "7 cm"), ("c", "7 in")
)
// Trivialized fuzzy match
def fuzzyMatch(s1: String, s2: String): Boolean = {
val st1 = s1.toLowerCase.replace(".00", "").replace("inches", "in")
val st2 = s2.toLowerCase.replace(".00", "").replace("inches", "in")
st1 == st2
}
// Create a Map of Sets of fuzzy-matched values from all 2-combinations per key
val fuzMap = input.combinations(2).foldLeft( Map[String, Seq[Set[String]]]() ){
case (m, Seq(t1: Tuple2[String, String], t2: Tuple2[String, String])) =>
if (fuzzyMatch(t1._2, t2._2)) {
val fuzSets = m.getOrElse(t1._1, Seq(Set(t1._2, t2._2))).map(
x => if (x.contains(t1._2) || x.contains(t2._2)) x ++ Set(t1._2, t2._2) else x
)
if (!fuzSets.flatten.contains(t1._2) && !fuzSets.flatten.contains(t2._2))
m + (t1._1 -> (fuzSets :+ Set(t1._2, t2._2)))
else
m + (t1._1 -> fuzSets)
}
else
m
}
// fuzMap: scala.collection.immutable.Map[String,Seq[Set[String]]] = Map(
// a -> List(Set(10 in, 10 inches), Set(15 in, 15 Inches, 15.00 inches)),
// b -> List(Set(2 cm, 2.00 CM)))
// )
请注意,对于较大的input
,第一个groupBy
键可能有意义,并为每个键生成2个组合。
下一步是使用创建的Map模糊匹配原始输入:
// Fuzzy-match original input using fuzMap
val fuzInput = input.map{ case (k, v) =>
if (fuzMap.get(k).isDefined) {
val fuzValues = fuzMap(k).map{
case x => if (x.contains(v)) Some(x.min) else None
}.flatten
if (!fuzValues.isEmpty)
(k, fuzValues.head)
else
(k, v)
}
else
(k, v)
}
// fuzInput: List[(String, String)] = List(
// (a,10 in), (a,15 Inches), (a,10 in), (a,15 Inches), (a,15 Inches),
// (b,2 cm), (b,4 cm), (b,2 cm),
// (c,7 cm), (c,7 in)
// )