mapPartitionsWithIndex给出不同的标题

时间:2015-05-21 18:03:02

标签: scala csv apache-spark

我有一些csv文件,例如:

ç
NU_NOTIF,CoordX_UTMSAD69,CoordY_UTMSAD69,TP_NOT,ID_AGRAVO,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,CS_RACA,CS_ESCOL_N,ID_CNS_SUS,NDUPLIC_N,DT_DIGITA,DT_TRANSUS,DT_TRANSDM,DT_TRANSSM,DT_TRANSRM,DT_TRANSRS,DT_TRANSSE,NU_LOTE_V,NU_LOTE_H,CS_FLXRET,FLXRECEBI,IDENT_MICR,MIGRADO_W,DT_INVEST,ID_OCUPA_N,DT_SORO,RESUL_SORO,DT_NS1,RESUL_NS1,DT_VIRAL,RESUL_VI_N,DT_PCR,RESUL_PCR_,SOROTIPO,HISTOPA_N,IMUNOH_N,DOENCA_TRA,EPISTAXE,GENGIVO,METRO,PETEQUIAS,HEMATURA,SANGRAM,LACO_N,PLASMATICO,EVIDENCIA,PLAQ_MENOR,TP_SISTEMA,Long_WGS84,Lat_WGS84
2332769,"677873,18","7468220,51",2,A90,29/01/2010 00:00:00,201004,2010,33,2273225,11/01/2010 00:00:00,201002,9,03, , ,26/02/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,29/01/2010 00:00:00, ,18/01/2010 00:00:00,1,, ,,4,,4, ,4,4,2, , , , , , , , , ,0.000000000000000,1,"-43.266430481500002","-22.884869715699999"
2273294,"676608,79","7467659,4",2,A90,22/01/2010 00:00:00,201003,2010,33,2708167,21/01/2010 00:00:00,201003,9,09, , ,04/02/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,, ,, ,, ,, ,, , , , , , , , , , , , , , ,0.000000000000000,1,"-43.278688469099997","-22.890070246099999"
2446032,"669591,392118294","7467756,59464924",2,A90,15/01/2010 00:00:00,201002,2010,33,2296608,09/01/2010 00:00:00,201001,9,09, , ,15/01/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,15/01/2010 00:00:00, ,,4,, ,,4,,4, ,4,4,9, , , , , , , , , ,0.000000000000000,1,"-43.347090180499997","-22.889919181600000"

为了解析这个跳过第一行(我不知道为什么放在那里,但我无能为力),我做了:

val csv = sc.textFile("./project/Casos_Notificados_Dengue_01_2010.csv")

val rdd = csv.mapPartitionsWithIndex(
    (i,iterator) => if (i == 0 && iterator.hasNext){
      iterator.next
      iterator
    }else iterator)

使用rdd.foreach(x => println(x.toString + "\n" ))检查rdd是否正常。不幸的是,它将随机行作为第一行而不是标题(我假设它应该是第一行,对吧?)。

所以,结果是这样的:

2258026,"685693,42","7458369,42",2,A90,27/01/2010 00:00:00,201004,2010,33,3005992,25/01/2010 00:00:00,201004,9,09, , ,27/04/2010 00:00:00,,,07/12/2010 00:00:00,,,,2010049, , , , , ,, ,, ,, ,, ,, , , , , , , , , , , , , , ,0.000000000000000,1,"-43.189041385899998","-22.972965925200000"

NU_NOTIF,CoordX_UTMSAD69,CoordY_UTMSAD69,TP_NOT,ID_AGRAVO,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,CS_RACA,CS_ESCOL_N,ID_CNS_SUS,NDUPLIC_N,DT_DIGITA,DT_TRANSUS,DT_TRANSDM,DT_TRANSSM,DT_TRANSRM,DT_TRANSRS,DT_TRANSSE,NU_LOTE_V,NU_LOTE_H,CS_FLXRET,FLXRECEBI,IDENT_MICR,MIGRADO_W,DT_INVEST,ID_OCUPA_N,DT_SORO,RESUL_SORO,DT_NS1,RESUL_NS1,DT_VIRAL,RESUL_VI_N,DT_PCR,RESUL_PCR_,SOROTIPO,HISTOPA_N,IMUNOH_N,DOENCA_TRA,EPISTAXE,GENGIVO,METRO,PETEQUIAS,HEMATURA,SANGRAM,LACO_N,PLASMATICO,EVIDENCIA,PLAQ_MENOR,TP_SISTEMA,Long_WGS84,Lat_WGS84

2258019,"686278,41","7459234,58",2,A90,18/01/2010 00:00:00,201003,2010,33,3005992,16/01/2010 00:00:00,201002,9,09, , ,22/01/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,, ,, ,, ,, ,, , , , , , , , , , , , , , ,0.000000000000000,1,"-43.183441365699998","-22.965089100099998"

2332769,"677873,18","7468220,51",2,A90,29/01/2010 00:00:00,201004,2010,33,2273225,11/01/2010 00:00:00,201002,9,03, , ,26/02/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,29/01/2010 00:00:00, ,18/01/2010 00:00:00,1,, ,,4,,4, ,4,4,2, , , , , , , , , ,0.000000000000000,1,"-43.

有谁知道如何将标题放在第一行? 另外,有没有办法使用mapPartitionsWithIndex获取csv的一些列?

编辑1

正如@ user3712791所述,它在`} else迭代器之后缺少 true ,所以,现在它还顺利。

val csv = sc.textFile("./project/Casos_Notificados_Dengue_01_2010.csv")

val rdd = csv.mapPartitionsWithIndex(
    (i,iterator) => if (i == 0 && iterator.hasNext){
      iterator.next
      iterator
    }else iterator),true)

@Paul,我误解了mapPartitionsWithIndex的作用。我认为它确实像标题和数据的键值(标题上方的行)。

我相信现在我必须做一个小组才能实现这个目标,还是有其他更好的想法呢?

(我必须这样做,因为我只需要数据中的5列)

2 个答案:

答案 0 :(得分:5)

如果未设置参数preservesPartitioning,则RDD中mapWithIndex的顺序不固定。

所以在函数a之后添加,true,它应该可以工作......比如:

val csv = sc.textFile("./project/Casos_Notificados_Dengue_01_2010.csv")

val rdd = csv.mapPartitionsWithIndex(
    ((i,iterator) => if (i == 0 && iterator.hasNext){
      iterator.next
      iterator
    }else iterator), true)

对于新信息:我认为它已经是一个全新的问题,但...... 首先,您将获取一个文本文件并将其放入RDD,它将是一个字符串的RDD,因此您可能无法使用GroupBy ...

首先让我们制作一个架构:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schemaArray = csv.collect()(0).split(",")
val schema =
  StructType(
    schemaArray.map(fieldName => StructField(fieldName, StringType, true)))

现在我们有了一个模式......它就像教程一样简单,但我会在这里写一下:

val rowRDD = rdd.map(_.split(",")).map(p => Row.fromSeq(p))

// Apply the schema to the RDD.
val schemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
schemaRDD.registerTempTable("casos")

// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT Column1,Column2,Column3,Column4,Column5 FROM casos")

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Column1: " + t(0)).collect().foreach(println)
//You may also want to refine the new object somehow
results.registerTempTable("tempTable")
val results2 = sqlContext.sql("SELECT Column1,Column2,Column3,Column4,Column5 FROM casos WHERE Column1=1")

我可能忘记了一些导入,所以我会留下教程的链接:https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#rdds

答案 1 :(得分:0)

因此,对于您的具体任务,我建议您查看http://spark-packages.org上的spark-csv包。如果您最终必须手动进行解析,您可能只想调用commandButton来获取第一行。

另一种选择是你可以这样做类似于spark-csv中的CsvRleation如何工作,它抓住第一行然后如果它出现在任何地方则过滤掉该行:

first()