我有一些csv文件,例如:
ç
NU_NOTIF,CoordX_UTMSAD69,CoordY_UTMSAD69,TP_NOT,ID_AGRAVO,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,CS_RACA,CS_ESCOL_N,ID_CNS_SUS,NDUPLIC_N,DT_DIGITA,DT_TRANSUS,DT_TRANSDM,DT_TRANSSM,DT_TRANSRM,DT_TRANSRS,DT_TRANSSE,NU_LOTE_V,NU_LOTE_H,CS_FLXRET,FLXRECEBI,IDENT_MICR,MIGRADO_W,DT_INVEST,ID_OCUPA_N,DT_SORO,RESUL_SORO,DT_NS1,RESUL_NS1,DT_VIRAL,RESUL_VI_N,DT_PCR,RESUL_PCR_,SOROTIPO,HISTOPA_N,IMUNOH_N,DOENCA_TRA,EPISTAXE,GENGIVO,METRO,PETEQUIAS,HEMATURA,SANGRAM,LACO_N,PLASMATICO,EVIDENCIA,PLAQ_MENOR,TP_SISTEMA,Long_WGS84,Lat_WGS84
2332769,"677873,18","7468220,51",2,A90,29/01/2010 00:00:00,201004,2010,33,2273225,11/01/2010 00:00:00,201002,9,03, , ,26/02/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,29/01/2010 00:00:00, ,18/01/2010 00:00:00,1,, ,,4,,4, ,4,4,2, , , , , , , , , ,0.000000000000000,1,"-43.266430481500002","-22.884869715699999"
2273294,"676608,79","7467659,4",2,A90,22/01/2010 00:00:00,201003,2010,33,2708167,21/01/2010 00:00:00,201003,9,09, , ,04/02/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,, ,, ,, ,, ,, , , , , , , , , , , , , , ,0.000000000000000,1,"-43.278688469099997","-22.890070246099999"
2446032,"669591,392118294","7467756,59464924",2,A90,15/01/2010 00:00:00,201002,2010,33,2296608,09/01/2010 00:00:00,201001,9,09, , ,15/01/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,15/01/2010 00:00:00, ,,4,, ,,4,,4, ,4,4,9, , , , , , , , , ,0.000000000000000,1,"-43.347090180499997","-22.889919181600000"
为了解析这个跳过第一行(我不知道为什么放在那里,但我无能为力),我做了:
val csv = sc.textFile("./project/Casos_Notificados_Dengue_01_2010.csv")
val rdd = csv.mapPartitionsWithIndex(
(i,iterator) => if (i == 0 && iterator.hasNext){
iterator.next
iterator
}else iterator)
使用rdd.foreach(x => println(x.toString + "\n" ))
检查rdd是否正常。不幸的是,它将随机行作为第一行而不是标题(我假设它应该是第一行,对吧?)。
所以,结果是这样的:
2258026,"685693,42","7458369,42",2,A90,27/01/2010 00:00:00,201004,2010,33,3005992,25/01/2010 00:00:00,201004,9,09, , ,27/04/2010 00:00:00,,,07/12/2010 00:00:00,,,,2010049, , , , , ,, ,, ,, ,, ,, , , , , , , , , , , , , , ,0.000000000000000,1,"-43.189041385899998","-22.972965925200000"
NU_NOTIF,CoordX_UTMSAD69,CoordY_UTMSAD69,TP_NOT,ID_AGRAVO,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,CS_RACA,CS_ESCOL_N,ID_CNS_SUS,NDUPLIC_N,DT_DIGITA,DT_TRANSUS,DT_TRANSDM,DT_TRANSSM,DT_TRANSRM,DT_TRANSRS,DT_TRANSSE,NU_LOTE_V,NU_LOTE_H,CS_FLXRET,FLXRECEBI,IDENT_MICR,MIGRADO_W,DT_INVEST,ID_OCUPA_N,DT_SORO,RESUL_SORO,DT_NS1,RESUL_NS1,DT_VIRAL,RESUL_VI_N,DT_PCR,RESUL_PCR_,SOROTIPO,HISTOPA_N,IMUNOH_N,DOENCA_TRA,EPISTAXE,GENGIVO,METRO,PETEQUIAS,HEMATURA,SANGRAM,LACO_N,PLASMATICO,EVIDENCIA,PLAQ_MENOR,TP_SISTEMA,Long_WGS84,Lat_WGS84
2258019,"686278,41","7459234,58",2,A90,18/01/2010 00:00:00,201003,2010,33,3005992,16/01/2010 00:00:00,201002,9,09, , ,22/01/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,, ,, ,, ,, ,, , , , , , , , , , , , , , ,0.000000000000000,1,"-43.183441365699998","-22.965089100099998"
2332769,"677873,18","7468220,51",2,A90,29/01/2010 00:00:00,201004,2010,33,2273225,11/01/2010 00:00:00,201002,9,03, , ,26/02/2010 00:00:00,,,16/11/2010 00:00:00,,,,2010041, , , , , ,29/01/2010 00:00:00, ,18/01/2010 00:00:00,1,, ,,4,,4, ,4,4,2, , , , , , , , , ,0.000000000000000,1,"-43.
有谁知道如何将标题放在第一行? 另外,有没有办法使用mapPartitionsWithIndex获取csv的一些列?
编辑1
正如@ user3712791所述,它在`} else迭代器之后缺少 true ,所以,现在它还顺利。
val csv = sc.textFile("./project/Casos_Notificados_Dengue_01_2010.csv")
val rdd = csv.mapPartitionsWithIndex(
(i,iterator) => if (i == 0 && iterator.hasNext){
iterator.next
iterator
}else iterator),true)
@Paul,我误解了mapPartitionsWithIndex的作用。我认为它确实像标题和数据的键值(标题上方的行)。
我相信现在我必须做一个小组才能实现这个目标,还是有其他更好的想法呢?
(我必须这样做,因为我只需要数据中的5列)
答案 0 :(得分:5)
如果未设置参数preservesPartitioning,则RDD中mapWithIndex的顺序不固定。
所以在函数a之后添加,true,它应该可以工作......比如:
val csv = sc.textFile("./project/Casos_Notificados_Dengue_01_2010.csv")
val rdd = csv.mapPartitionsWithIndex(
((i,iterator) => if (i == 0 && iterator.hasNext){
iterator.next
iterator
}else iterator), true)
对于新信息:我认为它已经是一个全新的问题,但...... 首先,您将获取一个文本文件并将其放入RDD,它将是一个字符串的RDD,因此您可能无法使用GroupBy ...
首先让我们制作一个架构:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schemaArray = csv.collect()(0).split(",")
val schema =
StructType(
schemaArray.map(fieldName => StructField(fieldName, StringType, true)))
现在我们有了一个模式......它就像教程一样简单,但我会在这里写一下:
val rowRDD = rdd.map(_.split(",")).map(p => Row.fromSeq(p))
// Apply the schema to the RDD.
val schemaRDD = sqlContext.applySchema(rowRDD, schema)
// Register the SchemaRDD as a table.
schemaRDD.registerTempTable("casos")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT Column1,Column2,Column3,Column4,Column5 FROM casos")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Column1: " + t(0)).collect().foreach(println)
//You may also want to refine the new object somehow
results.registerTempTable("tempTable")
val results2 = sqlContext.sql("SELECT Column1,Column2,Column3,Column4,Column5 FROM casos WHERE Column1=1")
我可能忘记了一些导入,所以我会留下教程的链接:https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#rdds
答案 1 :(得分:0)
因此,对于您的具体任务,我建议您查看http://spark-packages.org上的spark-csv包。如果您最终必须手动进行解析,您可能只想调用commandButton
来获取第一行。
另一种选择是你可以这样做类似于spark-csv中的CsvRleation如何工作,它抓住第一行然后如果它出现在任何地方则过滤掉该行:
first()