我有一个包含几个城市描述的csv文件:
Cities_information_extract.csv
我可以使用python pandas.read_csv或R read.csv方法解析该文件。它们都返回693行25列。
我尝试使用Spark 1.6.0和scala加载csv,但未成功。 为此,我使用了spark-csv和commons-csv(已包含在spark jars路径中)。 这就是我尝试过的:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
然后我尝试使用univocity解析器:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
检查文件时,我发现描述字段中存在一些带有嵌入引号的html标记,例如
<div id="someID">
我尝试使用python使用正则表达式删除所有html标签:
import os
import re
pattern = re.compile("<[^>]*>") # find all html tags <..>
with io.open("Cities_information_extract.csv", "r", encoding="utf-8") as infile:
text = infile.read()
text = re.sub(pattern, " ", text)
with io.open("cities_info_clean.csv", "w", encoding="utf-8") as outfile:
outfile.write(text)
接下来,我再次尝试了不带html标记的新文件:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
java.io.IOException: (startline 1) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:304)
[...]
使用单义分析器:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
尽管spark-csv仍然失败,但是python和R都能够正确解析两个文件。有任何建议使用spark-scala正确解析此csv文件吗?