在pyspark中读取test2.csv文件时遇到问题。
测试文件test1.csv
a1^b1^c1^d1^e1
a2^"this is having
multiline data1
multiline data2"^c2^d2^e2
a3^b3^c3^d3^e3
a4^b4^c4^d4^e4
测试文件test2.csv
a1^b1^c1^d1^e1
a2^this is having
multiline data1
multiline data2^c2^d2^e2
a3^b3^c3^d3^e3
a4^b4^c4^d4^e4
下面是代码
schema = StructType([
StructField("A", StringType()),
StructField("B", StringType()),
StructField("C", StringType()),
StructField("D", StringType()),
StructField("E", StringType())
])
为上述2个csv文件创建数据框。
df1=spark.read.csv("s3_path/test1.csv",schema=schema,inferSchema=True,multiLine=True,sep='^')
df1.show(10,False)
print ('df1.count() is: ', df1.count())
Below is the output when I read the test1.csv file
+---+-----------------------------------------------+---+---+---+
|A |B |C |D |E |
+---+-----------------------------------------------+---+---+---+
|a1 |b1 |c1 |d1 |e1 |
|a2 |this is having
multiline data1
multiline data2|c2 |d2 |e2 |
|a3 |b3 |c3 |d3 |e3 |
|a4 |b4 |c4 |d4 |e4 |
+---+-----------------------------------------------+---+---+---+
df1.count() is: 4
df2 = spark.read.csv("s3_path/test2.csv",schema=schema,inferSchema=True,multiLine=True,sep='^')
df2.show(10,False)
print ('df2.count() is: ', df2.count())
Below is the output when I read the test2.csv file
+---------------+---------------+----+----+----+
|A |B |C |D |E |
+---------------+---------------+----+----+----+
|a1 |b1 |c1 |d1 |e1 |
|a2 |this is having |null|null|null|
|multiline data1|null |null|null|null|
|multiline data2|c2 |d2 |e2 |null|
|a3 |b3 |c3 |d3 |e3 |
|a4 |b4 |c4 |d4 |e4 |
+---------------+---------------+----+----+----+
df2.count() is: 6
源文件:
如果我们看到源文件中的差异。 test1.csv在多行数据的开头和结尾处有"
。但是test2.csv没有。
问题说明:B列第二行包含多行数据。如果我们看到df2的输出,则它有6条记录,这里spark正在将其读取为新记录,即not correct
。
df1的输出有4条记录,B列第二行中的多行数据被视为一个字符串,correct
。
问题:有人可以帮助修复代码以正确地读取test2.csv文件。