从通过读取csv创建的spark数据集中删除第一行

时间:2018-08-29 05:07:57

标签: java scala apache-spark pyspark apache-spark-dataset

我有一个csv(employee.csv)文件,如下所示:

20180011,20180011123,007,07
Employee_ID,Name,Country
1,Maddy,IND
2,Sun,US

现在我正在使用spark读取csv文件,如下所示:

Dataset<Row> dataset = spark.read().format("csv")
                        .option("header", "false")
                        .load("./employee.csv");

现在,我需要摆脱CSV文件的第一行20180011,20180011123,007,07并加载到带有CSV头文件的数据集

Employee_ID,Name,Country
1,Maddy,IND
2,Sun,US

有人可以帮我吗?

4 个答案:

答案 0 :(得分:1)

此代码首先过滤垃圾行,然后提取标头,然后转换为Dataframe。

    val ss = SparkSession.builder().appName("local").master("local[*]").getOrCreate()

    val path = "C:\\Users\\user1\\data.txt"

    val data = ss.sparkContext.textFile(path)
    val junk = data.first()

    val fdata = data.filter(x => x != junk) // removes the first line

    val header = fdata.filter(x => x.split(",")(1) == "Name").collect().mkString // filtering the header line.

    import ss.implicits._
    val df = fdata
        .filter(x => x.split(",")(1) != "Name") // filtering all except header line
        .map(x => x.split(","))
        .map(t => (t(0), t(1), t(2))) //splitting data to tuples
        .toDF(header.split(","): _*) // applying header string as column names
    df.show()

输出:

+-----------+-----+-------+
|Employee_ID| Name|Country|
+-----------+-----+-------+
|          1|Maddy|    IND|
|          2|  Sun|     US|
+-----------+-----+-------+

答案 1 :(得分:0)

data = sc.textFile('employee.csv')

header = data.first()

data = data.filter(row => row!= header)

您可以立即开始处理数据。 希望您也可以找到一种用Java实现此方法的方法。

答案 2 :(得分:0)

您可以尝试此方法:

val csvData = spark.read.csv("./abc.csv") //read the csv
val firstRow = csvData.head
val filteredFirstRowData = csvData.filter((x) => x != firstRow) //remove the unwanted header
val lastColumnDropData = filteredFirstRowData.drop(filteredFirstRowData.col("_c3")) //remove the unwanted last column
val headers = lastColumnDropData.head 
val filteredHeaderData = lastColumnDropData.filter((x) => x != headers) //remove the header from the body
val seqHeader = headers.toSeq.asInstanceOf[Seq[String]]
val finalDF = filteredHeaderData.toDF(seqHeader: _*) //restructure the schema and add new schema header
finalDF.show

输出:

+-----------+-----+-------+
|Employee_ID| Name|Country|
+-----------+-----+-------+
|          1|Maddy|    IND|
|          2|  Sun|     US|
+-----------+-----+-------+

答案 3 :(得分:0)

这是静态解决方案

val file = sc.textFile("/FileStore/tables/sample.csv")

val dfFile= file.map(line => line.split(",")).
          filter(lines => lines.length == 3 && lines(0)!= "Employee_ID").
          map(row => (row(0), row(1), row(2))).
          toDF("Employee_ID","Name","Country")
dfFile.show

一些有用的链接

https://stackoverflow.com/a/37780783/7130689

Spark SQL - loading csv/psv files with some malformed records