如何使Distinct元组scala

时间:2016-02-24 15:50:49

标签: scala apache-spark

我有一个RDD,我想创建一个具有唯一值的新RDD,但我有一个错误 代码:

val rdd = sc.textFile("/user/ergorenova/socialmedia/allus/archivosOrigen").map( _.split(",", -1) match {
  case Array(caso, canal, lote, estado, estadoo, estadooo, fechacreacioncaso, fechacierrecaso, username, clientid, nombre, apellido, ani, email) =>(canal, username, ani, email)
}).distinct

val twtface = rdd.map {
  case (  canal, username, ani, email ) =>
    val campoAni = "ANI"
    (campoAni , ani , canal , username)
}.distinct()

twtface.take(3).foreach(println)

这是CSV文件

caso2,canal2,lote,estado3,estado4,estado5,fechacreacioncaso2,fechacierrecaso2,username,clientid,nombre,apellido,ani,email
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm@test.com
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm@test.com
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,

错误:

scala.MatchError: [Ljava.lang.String;@dea08cc (of class [Ljava.lang.String;)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

1 个答案:

答案 0 :(得分:0)

我认为错误是由于csv文件中缺少/附加​​换行符所致。

您的分割和匹配假定csv的每一行都有14个字段。根据您使用的编码或文本编辑器,您可能在文档的末尾添加了其他新行。

我的建议是验证每一行并添加一个包含所有内容的案例,以便为您提供更详细的错误消息,这样您就可以避免模糊的MatchError。