我有一个RDD,我想创建一个具有唯一值的新RDD,但我有一个错误 代码:
val rdd = sc.textFile("/user/ergorenova/socialmedia/allus/archivosOrigen").map( _.split(",", -1) match {
case Array(caso, canal, lote, estado, estadoo, estadooo, fechacreacioncaso, fechacierrecaso, username, clientid, nombre, apellido, ani, email) =>(canal, username, ani, email)
}).distinct
val twtface = rdd.map {
case ( canal, username, ani, email ) =>
val campoAni = "ANI"
(campoAni , ani , canal , username)
}.distinct()
twtface.take(3).foreach(println)
这是CSV文件
caso2,canal2,lote,estado3,estado4,estado5,fechacreacioncaso2,fechacierrecaso2,username,clientid,nombre,apellido,ani,email
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm@test.com
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm@test.com
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
错误:
scala.MatchError: [Ljava.lang.String;@dea08cc (of class [Ljava.lang.String;)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
答案 0 :(得分:0)
我认为错误是由于csv文件中缺少/附加换行符所致。
您的分割和匹配假定csv的每一行都有14个字段。根据您使用的编码或文本编辑器,您可能在文档的末尾添加了其他新行。
我的建议是验证每一行并添加一个包含所有内容的案例,以便为您提供更详细的错误消息,这样您就可以避免模糊的MatchError。