我有一个csv(employee.csv)文件,如下所示:
20180011,20180011123,007,07
Employee_ID,Name,Country
1,Maddy,IND
2,Sun,US
现在我正在使用spark读取csv文件,如下所示:
Dataset<Row> dataset = spark.read().format("csv")
.option("header", "false")
.load("./employee.csv");
现在,我需要摆脱CSV文件的第一行20180011,20180011123,007,07
并加载到带有CSV头文件的数据集
Employee_ID,Name,Country
1,Maddy,IND
2,Sun,US
有人可以帮我吗?
答案 0 :(得分:1)
此代码首先过滤垃圾行,然后提取标头,然后转换为Dataframe。
val ss = SparkSession.builder().appName("local").master("local[*]").getOrCreate()
val path = "C:\\Users\\user1\\data.txt"
val data = ss.sparkContext.textFile(path)
val junk = data.first()
val fdata = data.filter(x => x != junk) // removes the first line
val header = fdata.filter(x => x.split(",")(1) == "Name").collect().mkString // filtering the header line.
import ss.implicits._
val df = fdata
.filter(x => x.split(",")(1) != "Name") // filtering all except header line
.map(x => x.split(","))
.map(t => (t(0), t(1), t(2))) //splitting data to tuples
.toDF(header.split(","): _*) // applying header string as column names
df.show()
输出:
+-----------+-----+-------+
|Employee_ID| Name|Country|
+-----------+-----+-------+
| 1|Maddy| IND|
| 2| Sun| US|
+-----------+-----+-------+
答案 1 :(得分:0)
data = sc.textFile('employee.csv')
header = data.first()
data = data.filter(row => row!= header)
您可以立即开始处理数据。 希望您也可以找到一种用Java实现此方法的方法。
答案 2 :(得分:0)
您可以尝试此方法:
val csvData = spark.read.csv("./abc.csv") //read the csv
val firstRow = csvData.head
val filteredFirstRowData = csvData.filter((x) => x != firstRow) //remove the unwanted header
val lastColumnDropData = filteredFirstRowData.drop(filteredFirstRowData.col("_c3")) //remove the unwanted last column
val headers = lastColumnDropData.head
val filteredHeaderData = lastColumnDropData.filter((x) => x != headers) //remove the header from the body
val seqHeader = headers.toSeq.asInstanceOf[Seq[String]]
val finalDF = filteredHeaderData.toDF(seqHeader: _*) //restructure the schema and add new schema header
finalDF.show
输出:
+-----------+-----+-------+
|Employee_ID| Name|Country|
+-----------+-----+-------+
| 1|Maddy| IND|
| 2| Sun| US|
+-----------+-----+-------+
答案 3 :(得分:0)
这是静态解决方案
val file = sc.textFile("/FileStore/tables/sample.csv")
val dfFile= file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "Employee_ID").
map(row => (row(0), row(1), row(2))).
toDF("Employee_ID","Name","Country")
dfFile.show
一些有用的链接
https://stackoverflow.com/a/37780783/7130689
Spark SQL - loading csv/psv files with some malformed records