我想读取包含以下数据的文件:
"Name","Surname","Age","Birthdate","Address","PhoneNumber"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi","9022"
"Tanvi","Mutha","48","1969-03-24","A-23,Valencia,Mundhwa","1256","Yes"
"Shivani","Adsar","55","1961-11-09","Saptami-234,Udita,Salt Lake","5485"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area","5555"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi"
在使用spark.read.option(delimiter,",").csv(filename)
读取文件时,我可以正确地正确读取列地址,即使它包含','这是分隔符。
但是这种方法的问题在于,对于包含额外或更少列数的行,read函数分别在创建的数据框中截断或附加额外的分隔符。这不是所需的输出。
我想要的输出是包含所需数量的分隔符的行,在这种情况下为5。需要拒绝具有更多或更少分隔符的记录。
所以好的记录是:
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi","9022"
"Shivani","Adsar","55","1961-11-09","Saptami-234,Udita,Salt Lake","5485"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area","5555"
我的不良记录是:
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area"
"Tanvi","Mutha","48","1969-03-24","A-23,Valencia,Mundhwa","1256","Yes"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi"
如上所述阅读文件并不能让我识别不良记录。
如何做到这一点?
答案 0 :(得分:0)
查看您的数据
"Name","Surname","Age","Birthdate","Address","PhoneNumber"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi","9022"
"Tanvi","Mutha","48","1969-03-24","A-23,Valencia,Mundhwa","1256","Yes"
"Shivani","Adsar","55","1961-11-09","Saptami-234,Udita,Salt Lake","5485"
"Chaitra","Shenoy","21","1995-08-26","A-123,Spring blossom Area","5555"
"Sapna","Soni","22","1994-04-16","B-56,Ganga Park,Ghorpadi"
似乎有一个标题可用于数据框中的列名。您可以使用标题选项和格式选项,如下所示
spark.read
.format("com.databricks.spark.csv")
.option("header", true)
.csv("path to your csv file")
.show(false)
这应该为您提供输出数据框
+-------+-------+---+----------+---------------------------+-----------+
|Name |Surname|Age|Birthdate |Address |PhoneNumber|
+-------+-------+---+----------+---------------------------+-----------+
|Chaitra|Shenoy |21 |1995-08-26|A-123,Spring blossom Area |null |
|Sapna |Soni |22 |1994-04-16|B-56,Ganga Park,Ghorpadi |9022 |
|Tanvi |Mutha |48 |1969-03-24|A-23,Valencia,Mundhwa |1256 |
|Shivani|Adsar |55 |1961-11-09|Saptami-234,Udita,Salt Lake|5485 |
|Chaitra|Shenoy |21 |1995-08-26|A-123,Spring blossom Area |5555 |
|Sapna |Soni |22 |1994-04-16|B-56,Ganga Park,Ghorpadi |null |
+-------+-------+---+----------+---------------------------+-----------+
我希望答案有帮助