scala使用模糊字符串匹配合并元组

时间:2018-02-01 23:12:52

标签: scala group-by fuzzy-comparison

输入:

val input = List((a, 10 Inches), (a, 10.00 inches), (a, 15 in), (b, 2 cm), (b, 2.00 CM))

我喜欢输出

val output = List((a, 10 Inches, 0.66), (b, 2 cm, 1))

我还有一个效用函数,对于模糊匹配(“10英寸”,“10.00英寸”)返回true

fuzzyMatch(s1, s2) returns

true for s1 = "10 Inches" and s2 = "10.00 inches"
false for s1 = "10 Inches" and s2 = "15 in"
false for s1 = "10.00 inches" and s2 = "15 in"
true for s1 = "2 cm" and s2 = "2.00 CM"

Output = List of (unique_name, max occurred string value, (max number of occurrences/total occurrences))

如何将上述输入减少到输出

到目前为止我有什么

val tupleMap = input.groupBy(identity).mapValues(_.size)
val totalOccurrences = input.groupBy(_._1).mapValues(_.size)
val maxNumberOfValueOccurrences = tupleMap.groupBy(_._1._1).mapValues(_.values.max)
val processedInput = tupleMap
      .filter {
        case (k, v) => v == maxNumberOfValueOccurrences(k._1)
      }
      .map {
        case (k, v) => (k._1, k._2, v.toDouble / totalOccurrences(k._1))
      }.toSeq

给出完全匹配的比率。我如何适应我的模糊匹配,以便将所有相似值组合并计算比率?模糊匹配值可以是任何匹配。

它本质上是一个自定义组,使用我的fuzzyMatch(...)方法。但我想不出这里的解决方案。

经过一番思考,我得到了类似下面的内容。更好的解决方案将不胜感激。

val tupleMap: Map[String, Seq[String]] = input.groupBy(_._1).mapValues(_.map(_._2))

val result = tupleMap mapValues {
list =>
val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()

list foreach {
  value =>
    // Using fuzzy match to find the best match
    // findBestMatch (uses fuzzyMatch) returns the Option(key) 
    // if there exists a similar key, if not returns None
    val bestMatch = findBestMatch(value, valueCountsMap.keySet.toSeq) 
    if (bestMatch.isDefined) {
      val newValueCount = valueCountsMap.getOrElse(bestMatch.get, 0) + 1
      valueCountsMap(bestMatch.get) = newValueCount
    } else {
      valueCountsMap(value) = 1
    }
}

val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
(maxOccurredValueNCount._1, maxOccurredValueNCount._2)
}

2 个答案:

答案 0 :(得分:2)

如果出于某种原因,转换为数值的方法不适合您,这里的代码似乎可以满足您的需求:

def fuzzyMatch(s1: String, s2: String): Boolean = {
  // fake implementation
  val matches = List(("15 Inches", "15.00 inches"), ("2 cm", "2.00 CM"))
  s1.equals(s2) || matches.exists({
    case (m1, m2) => (m1.equals(s1) && m2.equals(s2)) || (m1.equals(s2) && m2.equals(s1))
  })
}

 def test(): Unit = {
  val input = List(("a", "15 Inches"), ("a", "15.00 inches"), ("a", "10 in"), ("b", "2 cm"), ("b", "2.00 CM"))
  val byKey = input.groupBy(_._1).mapValues(l => l.map(_._2))
  val totalOccurrences = byKey.mapValues(_.size)
  val maxByKey = byKey.mapValues(_.head) //random "max" selection logic

  val processedInput: List[(String, String, Double)] = maxByKey.map({
    case (mk, mv) =>
      val matchCount = byKey(mk).count(tv => fuzzyMatch(tv, mv))
      (mk, mv, matchCount / totalOccurrences(mk).asInstanceOf[Double])
  })(breakOut)

  println(processedInput)
}

打印

  

列表((b,2 cm,1.0),(a,15英寸,0.6666666666666666))

答案 1 :(得分:1)

这是一种使用模糊匹配预处理input的方法,然后将其用作现有代码的输入。

我们的想法是首先生成input元组的2组合,模糊匹配它们以创建由每个键的匹配值组成的不同集合的Map,最后使用Map模糊匹配您的原input

为确保涵盖更多任意案例,我已扩展您的input

val input = List(
  ("a", "10 in"), ("a", "15 in"), ("a", "10 inches"), ("a", "15 Inches"), ("a", "15.00 inches"),
  ("b", "2 cm"), ("b", "4 cm"), ("b", "2.00 CM"),
  ("c", "7 cm"), ("c", "7 in")
)

// Trivialized fuzzy match
def fuzzyMatch(s1: String, s2: String): Boolean = {
  val st1 = s1.toLowerCase.replace(".00", "").replace("inches", "in")
  val st2 = s2.toLowerCase.replace(".00", "").replace("inches", "in")
  st1 == st2
}

// Create a Map of Sets of fuzzy-matched values from all 2-combinations per key
val fuzMap = input.combinations(2).foldLeft( Map[String, Seq[Set[String]]]() ){
  case (m, Seq(t1: Tuple2[String, String], t2: Tuple2[String, String])) =>
    if (fuzzyMatch(t1._2, t2._2)) {
      val fuzSets = m.getOrElse(t1._1, Seq(Set(t1._2, t2._2))).map(
        x => if (x.contains(t1._2) || x.contains(t2._2)) x ++ Set(t1._2, t2._2) else x
      )
      if (!fuzSets.flatten.contains(t1._2) && !fuzSets.flatten.contains(t2._2))
        m + (t1._1 -> (fuzSets :+ Set(t1._2, t2._2)))
      else
        m + (t1._1 -> fuzSets)
    }
    else
      m
}
// fuzMap: scala.collection.immutable.Map[String,Seq[Set[String]]] = Map(
//   a -> List(Set(10 in, 10 inches), Set(15 in, 15 Inches, 15.00 inches)), 
//   b -> List(Set(2 cm, 2.00 CM)))
// )

请注意,对于较大的input,第一个groupBy键可能有意义,并为每个键生成2个组合。

下一步是使用创建的Map模糊匹配原始输入:

// Fuzzy-match original input using fuzMap
val fuzInput = input.map{ case (k, v) => 
  if (fuzMap.get(k).isDefined) {
    val fuzValues = fuzMap(k).map{
      case x => if (x.contains(v)) Some(x.min) else None
    }.flatten
    if (!fuzValues.isEmpty)
      (k, fuzValues.head)
    else
      (k, v)
  }
  else
    (k, v)
}
// fuzInput: List[(String, String)] = List(
//   (a,10 in), (a,15 Inches), (a,10 in), (a,15 Inches), (a,15 Inches),
//   (b,2 cm), (b,4 cm), (b,2 cm),
//   (c,7 cm), (c,7 in)
// )