在Scalding中生成List [String]的diff

时间:2016-03-29 22:32:05

标签: java scala scalding

我的Scalding工作中有records:TypedType[(String, util.List[String])],其中第一个值是id,第二个是东西列表。想象一下:

("1", ["a","b","c"])
("1", ["a","b","c"])
("1", ["a","b","c"])
("2", ["a","b"])
("2", ["a","b","c"])
("3", ["a","b","c"])

records.groupBy(_._1)之后,我想仅输出给定ID的彼此不同的记录。对于上面的输入,输出应为:

("2", ["a","b"])
("2", ["a","b","c"])

我是Scalding的新手。什么是实现这一目标的优雅方式?

2 个答案:

答案 0 :(得分:0)

我不知道Scalding方面对你来说是否至关重要(你的收藏是否非常庞大?)但是在普通的Scala中我会这样做:

// Given:
val records = Seq( "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "2" -> List("a", "b"), "2" -> List("a", "b", "c"), "3" -> List("a", "b", "c"), "3" -> List("d")

val distinctValues = records.groupBy(_._1).map { case (k, v) => k -> v.toSet }
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 1 -> Set((1,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d))))

val havingMultipleDistinct = distinctValues.map { case (k, v) => v.size > 1 }
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d))))

val asRecords = havingMultipleDistinct.values.flatten
// => List((2,List(a, b)), (2,List(a, b, c)), (3,List(a, b, c)), (3,List(d)))

答案 1 :(得分:0)

如果每个键的值的大小足够小以适合内存,那么应该这样做:

records
  .group
  .toSet
  .filter(_.size > 1)
  .flatten

如果它太大,那么你可以自己加入管道:

val grouped = records.group
grouped
 .join(grouped)
 .collect { case(k, (a, b)) if a != b => k -> a }