Question

假设我正在写diff(s1: String, s2: String): List[String]来检查s1 == s2并返回错误列表：

s1[i] != s2[i]错误是s1[i] != s2[i]
s1[i]如果i >= s2.length错误是s1[i] is undefined
s2[i]如果i >= s1.length错误是s2[i] is missing

例如：

diff("a", "a")     // returns Nil
diff("abc", "abc") // Nil
diff("xyz", "abc") // List("x != a", "y != b", "z != c")
diff("abcd", "ab") // List("c is undefined", "d is undefined")
diff("ab", "abcd") // List("c is missing", "d is missing")
diff("", "ab")     // List("a is missing", "b is missing")  
diff("axy", "ab")  // List("x != b", "y is undefined")

你会怎么写？

P.S。我正在这样写diff：

def compare(pair: (Option[Char], Option[Char])) = pair match { 
  case (Some(x), None)    => Some(s"$x is undefined")
  case (None, Some(y))    => Some(s"$y is missing")
  case (Some(x), Some(y)) => if (x != y) Some(s"$x != $y") else None 
  case _ => None
}

def diff(s1: String, s2: String) = {
  val os1 = s1.map(Option.apply)
  val os2 = s2.map(Option.apply)
  os1.zipAll(os2, None, None).flatMap(compare)
}

Answer 1

更简洁

首先，这是我实现此方法的方法：

def diff(s1: String, s2: String): List[String] =
  (s1, s2).zipped.collect {
    case (x, y) if x != y => s"$x != $y"
  }.toList ++
    s1.drop(s2.length).map(x => s"$x is undefined") ++
    s2.drop(s1.length).map(y => s"$y is missing")

大约是原始实现的一半字符，在我看来，它至少具有可读性。您可能会说drop技巧太聪明了，也许您是对的，但是我认为一旦掌握，它的含义就很好了。

效率更高

像这样的方法是独立的并且易于测试，并且如果有可能在性能很重要的情况下使用它，则必须考虑强制性的实现。这是我的操作方法的简短概述：

def diffFast(s1: String, s2: String): IndexedSeq[String] = {
  val builder = Vector.newBuilder[String]

  def diff(short: String, long: String, status: String) = {
    builder.sizeHint(long.length)
    var i = 0

    while (i < short.length) {
      val x = s1.charAt(i)
      val y = s2.charAt(i)
      if (x != y) builder += s"$x != $y"
      i += 1
    }

    while (i < long.length) {
      val x = long.charAt(i)
      builder += s"$x is $status"
      i += 1
    }
  }

  if (s1.length <= s2.length) diff(s1, s2, "missing")
    else diff(s2, s1, "undefined")

  builder.result
}

通过提示大小等，您也许可以使它变得小。[更新：我继续并添加了此内容]，但是此版本可能接近最佳状态，我也发现它很易读-不像上面的简短实现或您的原始实现那样清晰，但是我发现它比其他答案中的递归实现更好。

请注意，这将返回IndexedSeq，而不是List。在此遵循原始的实现方式，而不是第一句话中的签名。如果您需要List，可以将Vector.newBuilder更改为List.newBuilder，但是矢量版本在大多数情况下可能会更快一些。

基准

我们可以一整天都在猜测性能，但是运行一些快速的JMH微基准测试非常容易，因此我们也可以这样做（完整资料here）。我将以下面的字符串作为简单示例：

val example1: String = "a" * 1000
val example2: String = "ab" * 100

我们可以针对您的原始版本（无论是原样并返回List），我的简洁版本，我的快速版本（返回IndexedSeq和List）的输入量来衡量吞吐量）和蒂姆的递归版本：

Benchmark                 Mode  Cnt       Score     Error  Units
DiffBench.checkConcise   thrpt   20   47412.127 ± 550.693  ops/s
DiffBench.checkFast      thrpt   20  108661.093 ± 371.827  ops/s
DiffBench.checkFastList  thrpt   20   91745.269 ± 157.128  ops/s
DiffBench.checkOrig      thrpt   20    8129.848 ±  59.989  ops/s
DiffBench.checkOrigList  thrpt   20    7916.637 ±  15.736  ops/s
DiffBench.checkRec       thrpt   20   62409.682 ± 580.529  ops/s

简而言之：就性能而言，您的原始实现确实很差（我想是因为所有分配都比多次遍历更多），我的简洁实现与（据说可读性较差）递归比较一个，并获得比原始吞吐量高六倍的吞吐量，而命令式实现的速度几乎是其他任何一种的两倍。

Answer 2

[参见下面的原始答案]

这可以通过递归算法完成：

def diff(a: String, b: String): List[String] = {
  @annotation.tailrec
  def loop(l: List[Char], r: List[Char], res: List[String]): List[String] =
    (l, r) match {
      case (Nil, Nil) =>
        res.reverse
      case (undef, Nil) =>
        res.reverse ++ undef.map(c => s"$c is undefined")
      case (Nil, miss) =>
        res.reverse ++ miss.map(c => s"$c is missing")
      case (lh :: lt, rh :: rt) if lh != rh =>
        loop(lt, rt, s"$lh != $rh" +: res)
      case (_ :: lt, _ :: rt) =>
        loop(lt, rt, res)
    }

  loop(a.toList, b.toList, Nil)
}

我个人认为，这比使用Option / zipAll / flatMap更为明显，但这显然是个人喜好和碰巧所要解决的问题。我认为这比较灵活，因为例如可以轻松地对其进行修改，以为所有未定义/缺失的字符生成单个错误字符串。

如果效率很重要，则此版本使用Iterator来避免创建临时列表，而使用嵌套的if / else而不是match：

def diff(a: String, b: String): List[String] = {
  val l = a.toIterator
  val r = b.toIterator

  @annotation.tailrec
  def loop(res: List[String]): List[String] =
    if (l.isEmpty) {
      res.reverse ++ r.map(c => s"$c is missing")
    } else {
      if (r.isEmpty) {
        res.reverse ++ l.map(c => s"$c is undefined")
      } else {
        val lhead = l.next()
        val rhead = r.next()

        if (lhead == rhead) {
          loop(res)
        } else {
          loop(s"$lhead != $rhead" +: res)
        }
      }
    }

  loop(Nil)
}

感谢Brian McCutchon指出使用String而不是List[Char]的问题，并感谢Andrey Tyukin鼓励我发布更有效的解决方案。

原始答案

递归实现并不太可怕：

def diff(a: String, b: String): List[String] = {
  @annotation.tailrec
  def loop(l: String, r: String, res: List[String]) : List[String] = (l, r) match {
    case ("", "") =>
      res
    case (lrem, "") =>
      res ++ lrem.map(c => s"$c is undefined")
    case ("", rrem) =>
      res ++ rrem.map(c => s"$c is missing")
    case _ if l.head != r.head =>
      loop(l.tail, r.tail, res :+ s"${l.head} != ${r.head}")
    case _ =>
      loop(l.tail, r.tail, res)
  }

 loop(a, b, Nil)
}

除非有很多错误，否则应执行OK，否则在res后面添加将变得昂贵。您可以通过在res之前添加前缀来解决此问题，然后在必要时在末尾进行反向操作，但这会使代码不太清晰。

Scala中两个字符串的差异

2 个答案:

更简洁

效率更高

基准

原始答案