烫伤:成对比较字符串?

时间:2014-07-15 12:22:18

标签: scala hadoop edit-distance scalding

通过烫伤,我需要:

  1. 按前3个字符串组字符串字段
  2. 使用edit-distance指标(http://en.wikipedia.org/wiki/Edit_distance
  3. 比较每个组中所有对中的字符串
  4. 将结果写入CSV文件,其中记录为string; string; distance
  5. 要对字符串进行分组,请使用mapgroupBy,如下例所示:

    import cascading.tuple.Fields
    import com.twitter.scalding._
    
    class Scan(args: Args) extends Job(args) {
      val output = TextLine("tmp/out.txt")
    
      val wordsList = List(
        ("aaaa"),
        ("aaabb"),
        ("aabbcc"),
        ("aaabccdd"),
        ("aaabbccdde"),
        ("aaabbddd"),
        ("bbbb"),
        ("bbbaaa"),
        ("bbaaabb"),
        ("bbbcccc"),
        ("bbbddde"),
        ("ccccc"),
        ("cccaaa"),
        ("ccccaabbb"),
        ("ccbbbddd"),
        ("cdddeee")
        )
    
      val orderedPipe =
        IterableSource[(String)](wordsList, ('word))
            .map('word -> 'key ){word:String => word.take(3)}
        .groupBy('key) {_.toList[String]('word -> 'x) }
            .debug
            .write(output)
    }
    

    结果我得到了:

    ['aaa', 'List(aaabbddd, aaabbccdde, aaabccdd, aaabb, aaaa)']
    ['aab', 'List(aabbcc)']
    ['bba', 'List(bbaaabb)']
    ['bbb', 'List(bbbddde, bbbcccc, bbbaaa, bbbb)']
    ['ccb', 'List(ccbbbddd)']
    ['ccc', 'List(ccccaabbb, cccaaa, ccccc)']
    ['cdd', 'List(cdddeee)']
    

    现在,在此示例中,我需要在此列表中使用aaa键为字符串编辑编辑距离:

    List(aaabbddd, aaabbccdde, aaabccdd, aaabb, aaaa)
    

    此列表中包含'bbb'键的所有字符串的下一个:

    List(bbbddde, bbbcccc, bbbaaa, bbbb)
    

    要计算每个组中所有字符串之间的编辑距离,我需要用自己的函数替换toList,我该怎么做?还有如何将我的函数结果写入CSV文件?

    谢谢!

    更新

    如何从Scalding List获取Pipe

    toList只返回另一个Pipe,所以我无法全部使用它:

      val orderedPipe =
        IterableSource[(String)](wordsList, ('word))
            .map('word -> 'key ){word:String => word.take(3)}
            .groupBy('key) {_.toList[String]('word -> 'x) }
            .combinations(2) //---ERROR! Pipe has no such method!
            .debug
            .write(output)
    

1 个答案:

答案 0 :(得分:1)

可以按照wikipedia

中的说明计算编辑距离
def editDistance(a: String, b: String): Int = {

    import scala.math.min

    def min3(x: Int, y: Int, z: Int) = min(min(x, y), z)

    val (m, n) = (a.length, b.length)

    val matrix = Array.fill(m + 1, n + 1)(0)

    for (i <- 0 to m; j <- 0 to n) {

        matrix(i)(j) = if (i == 0) j
                       else if (j == 0) i
                       else if (a(i-1) == b(j-1)) matrix(i-1)(j-1)
                       else min3(
                                 matrix(i - 1)(j) + 1,
                                 matrix(i)(j-1) + 1,
                                 matrix(i - 1)(j - 1) + 1) 
    }

    matrix(m)(n)
}

用于查找列表元素的成对编辑距离:

def editDistances(list: List[String]) = {

    list.combinations(2).toList.map(x => (x(0), x(1), editDistance(x(0), x(1))))
}

在groupBy中使用它:

  val orderedPipe =
      IterableSource[(String)](wordsList, ('word))
      .map('word -> 'key ){word:String => word.take(3)}
      .groupBy('key) {_.mapList[String, List[(String, String, Int)]]('word -> 'x)(editDistances)}
      .debug
      .write(output)    

就编写csv格式而言,您只需使用com.twitter.scalding.Csv类。

write(Csv(outputFile))