我们正在尝试使用spark-csv和univocity 1.5.0解析器读取一个3 gb文件,该文件在其列中有多个新行字符,但该文件在某些行的多列中基于换行符。在大文件的情况下会发生这种情况。
我们正在使用spark 1.6.1和scala 2.10
以下代码用于阅读文件:
sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("mode","FAILFAST")
.option("escape","\"")
.option("qoute"."\"")
.option("parserLib","univocity")
.load("abc.csv")
java.lang.exception:FAILFAST于01/20/2015。
示例文件 " A AAAAAAAA"" AA999"" AA999"" AA999"" 9999-99-99-99.99.99.999999 "," AAAAAA99"," Aaaaa Aaaaaaaa
99/99/9999 - AAA Aaaaaaa Aa:aaaaaaaa aa aaaaa,aaaaaaaa aaa aaaaaaa aaaaaaaaaa
Aaa aaaaa aa AAA aaa aaaaaaaaaaa
99/99/9999 AAAAA AAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAAAA AAA AA AAAAAA AAAAAA AAAAAAAAAA AAAAAAA AA AAAAAAAAA。
99/99/9999 Aaa&a; aaaaaa a / aaa aaaaaaa - AAA aaaaaaaa aaa&a 39a aaaaaaa
99/99/9999 AAA aaaaaa - aaaaaaa aaaaaaaaa
99/99/9999 AAA aaaaaa。 Aaa aaaa Aa。 AAAAAA了Aa:AAAAAAAAA AAAAAAAA AAAAAA,A AAAAAAA AAAA AAAAAAAAAA,AAAAA AAAAAAA AAAA AAAAAAAAAA(AAAA AAAAAAAAAAAA AAAAAAA)。 A和了Aa AAAAAA AA AAAAAAAAAA AAA AAAA AAAAAA AAAA AAAAA AA AAAAAAAAA,A AAAAAAAA AAAAA AAA AAAAA AAAAAAAA AAAAA AAAA AAAAA AA AAAAAAAAA。 Aaa aaaaaa aaaaaa aaaaaa aaaa aaaaaa。
99/99/9999 - aaaaa aaaaaaaa。
99/99/9999 - AAA
99/99/9999 AAA aaaaaa aaaaa aa Aaa 9999 aaaa aaaaaaaaa aaaaaaaaaa - aa A& Aa。 Aaaaaaaaa aaaaa aaaaaa。
99/99/9999 AAA aaaaa aaaaaa - aa aaaaaa aa aaaaa aaaaaa aa AAA aa AAA aaa aa aaaaaa aaaaaa aaaa-aaaaaaaaaa。 Aa aaaaaaaa aa aaaaaa A& Aa aaaaa aa aaaaa aaaaaaa。
99/99/9999 - Aaaaaa aaaaaa aaaa。 Aaaaaaa aaaa aaaa 99/99/9999 - 99/99/9999
99/99/9999 - AAAAAA AAAAAAA AA AA AAAA:AAAA AAAAA AAAA AAAAAA AAAA AAAA AAAAA AA AAA AAAAAAAAA
99/99/9999 Aaaaaa a / aaa aaaaaaa。 Aaaa aaaaaaaa aa aaaaaaaaaaa aa AA。
99/99/9999 Aaaaaa aaaaaa aaaaaa aaaa。
99/99/9999 Aaaaaaaa aaaaaa aa aaaaaa aaaa
99/99/9999 Aaaaa a / aaa aaaaaaa aaa&a 39a aaaaaaaaa aaaaaaaaaa aaaaaaa
99/99/9999 AAA aaaaaa A& Aa aaaaaa aaa aaaaaaaaaaaaa aaa aaaaa aaaaaa
99/99/9999 AAA aaaaa aaaaaa - aaaaa aaaaaaaaaaaaaaa aaa aaaaaaaaaaa aa aaaaaaaaaa AAA AAAAAA AAAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAA AAAA AAAAAA AA AAAAAA AAAAAA AAAAA AAAA AA AAAAAA AAA AAAAAAAA AAAAAAAAA A和AA。AAA AAAAAAAAA,AAAAAAAAA AAAAA AAAAAAAAA
99/99/9999 AAA AAAAAA AAAAAAA AAAA AAAAAA AA Aaa级9. A和AA。AAAAAA AA AAAAA AAAAA AAAA AAAAAAAA,AAAAAAAAAA AAAA AAAAAAAA AAA AAAA AAAAA AAAAAAAA AAAAAA
99/99/9999 AAA - aaaaaaaaaa aaaaaaaaaa。
AAA AAAAAAAAA AAAAAAAAAA AAAAAAA AAAA AAAAAAAAAAAA AAAAA AA AAAA AAAAAA AA AA AAAAAAA AAAAA AAAAAAAAA AAAAA AA AA AAAAAAAAAAA AAAA
99/99/9999 AAA AAAAA AAAAAA - AAAAAAAAAAAA AAAAAA AA 99/99/9999 AAAAAA AAAA AAAA AAAAA AAA AAAAAAAAAA A / AAAAAAAAA AAAAAAAAA AAAAAAAA。 Aaa级AAAA AAAAAAAAAAAA 99/99/9999 AA AAA AAA AAAAAAAAAAA aaaaaaaaaaaaa AAAAA 99/99/9999 AAAA AAA AA AAAAAAA AAAAAAAAA AAAAAAAA,AAAAAA AAA AA AAAAA AAAAAAAAA AA AA 99机管局AAA AAAAAAA AA AAAAAAAAA AAAAAAAA,AAA AAAAAAAAAA AAAAAAAA AAAAA AAAA AAAAAAAAAAA AAAA AAAA aaaaaaaaa aaaaaaaaaa aaaaaaaaaa。
99/99/9999 AAA AAAA AAAAA - AAA AAAAAAA AAAA A和AA。AAAAAAAAAA AA AAA AAAAAAAAAAAA AAAAA AAAA AAAA AAAAAAA AA AAAAA 9999
Aaaaaaaa aaaaaa aa aaaa aa aa aaa 9,9999 aaa aaaaaaa aaaaaaa aaaaa aaa aaaaaaaa aaaa Aa。 Aaaaaaaa aaa aaaa aaaaaa aa aaaaaaa aaaaaa aa A& Aa aaa aaaaaaaa aaaaaa aaaa aaaa。 AAAA AA AAAAAAA AAA AA AAAAA AAAA AAAAAAAAAA AAAAAAAAAA AAA AA AA AAAAA AAAAA AAAAAAAAAA AA AAAAAAAAAAAA。
99/99/9999 Aaaaaa aaa&a 39a aaaa AA
99/99/9999 - a / a aaaa aa aaaaaaaaaaaa
99/99/9999 Aaaaaa aaa&a 39a aaaa aaaaaaaaaaaa
99/99/9999 - aaaa aaaaaa aa aaaaaaaaaaa aaaaaaaa aaa aaa aaaaaaaaa 99/99/9999 - AAA AA AAAA AAAAAA AAAAAAAAAAAA AAA AAAAAAAAAAAA AAAA AAAA AAAAA AAAA AAAA AAA AAA 99,9999 AAAAA AAA AAA AAAAAAAAAA
99/99/9999 - aaaa aaa&aaaa aaaaaaaaaaaaaa aaaaaaaa aa aaaa aaaa aaaaaaa aaaaaaaaa 99/99/9999 - aaaa aaaaaa aa aaaaaaaaaaa aa:a / a aaaa aa aa aaaa。 Aaaaaaaaa aaaaaaa aaa aaaaaa aaaa aaa aaaaaaaaaaa aaa aaa aaaaaaa aaa aaa aaaaaaaa aaaaa aaa aaaa aaa aaaa aaaa aaaaa aaaaaaaa aaa aaaa aaa aaa aaa aaa aaa aaaa aaaa aaaa aaaa aaaa aaaa Aaaa aaa aa aaaaa a / aaaaa aaaaa。 Aaa aaaaaa aa aaaa aaaaa aaaaa。
99/99/9999 - Aaaaa AAA aaaaaa aaaaaaaa。 AAAAAAAAA AAAA AAAA AAAAA AAAAAAA AAAAAAAA AAA AAAAA AAAAAAAAAA AAAAAAAA,AAAAAAA,AAAAA AA的AAA AA AAAA AAAA AAAAAAA AA AAAAAAAA AA AAAAAAA,AAAA AAAAA,AAAAAA AAA,AAAA AA AAAAAAAA,AAAA AA AAAAAAAAAA,AAAAAAA AAAAA AAAAAA。 AAAAA AAA AAAAA AAAA AAAAAAA AAAAAAAA AA AAA AAAAAAAAAA AA AAAAAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAA(AAAAA AA AA AAAAAAAAAA AAAA级9999)。 Aaaa aa aaaaa aa aaaa aa aaaaaa aa aaaa。 AAA AAAAAAAA AAA AAAAAAAAAA AA的AAAAAAAAAA AA AAAAAAAA AAAAAAAA,AAAAAA AAAAA AA AAA AAAAAA AAAAA AAAAAAAAAAA AAA AAAAAAAA AA AAA AAAAA AAAAAAAA AA AAA 9999 AA AAA AAAAAAA AA AAAAAAA AA AAAAAAA AAAAAAAA。 Aa Aaa 9999,Aa。 AAAAAAAA AAAAA AAAAAAAAAA AAA AAAAAAAA AAAAAAAA,AAA AA AAAA AAA AA AAAAAAA AAAA AAA AA AAA AAAAAAAA。 Aa aa aaaaa aa Aa。 aaaa aaaaaaaaaa aaaaaaaa aaaaaaaaa aaaa aaaa。 AAA A / A A AAA AAAAA AAAAA AAA 9999 AAAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAA,AAAA AA aaaaaaaaaaaaa AAA AAAAAAAAA,AAAAAAA,AAAAAAAAA,AAAAAAAAA AAAA,aaaaaaaaaaaaa。 AAAAAAAAA:AAAAA AAA AAAAAAAA AA AAAAAAA AA AAAA AAAAA,AAAA AAAAAAA AAA AAAAAAAA AAAAAAA AAAAAAAAA AAA AAAA AA AAAA AAAAAAAA AA AAAAA AAAAAAAAA AAAAAAA AA AAAA-AAAAAAAAAA AAAAAAAAAA,AAA AAAAAAAAA AAAAAAA AAAA。 "
答案 0 :(得分:1)
Spark的CSV关系基于其TextBasedFileFormat
,并且只能逐行查看输入,因此它不支持多行记录。如果您需要支持多行记录,可以使用wholeTextFiles
代替并手动解析输入(但理想情况下,这应该作为预处理数据清理作业完成)。