spark-csv无法使用嵌入式html和引号进行解析

时间:2018-08-21 18:43:05

标签: scala csv apache-spark databricks spark-csv

我有一个包含几个城市描述的csv文件:

Cities_information_extract.csv

我可以使用python pandas.read_csv或R read.csv方法解析该文件。它们都返回693行25列。

我尝试使用Spark 1.6.0和scala加载csv,但未成功。 为此,我使用了spark-csvcommons-csv(已包含在spark jars路径中)。 这就是我尝试过的:

var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")

cities_info.count()

// ERROR
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
    at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
    at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
    at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
    at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)

然后我尝试使用univocity解析器:

var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")

cities_info.count()

// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]

检查文件时,我发现描述字段中存在一些带有嵌入引号的html标记,例如

<div id="someID">

我尝试使用python使用正则表达式删除所有html标签:

import os
import re

pattern = re.compile("<[^>]*>")    # find all html tags <..>
with io.open("Cities_information_extract.csv", "r", encoding="utf-8") as infile:
    text = infile.read()
    text = re.sub(pattern, " ", text)
    with io.open("cities_info_clean.csv", "w", encoding="utf-8") as outfile:
        outfile.write(text)

接下来,我再次尝试了不带html标记的新文件:

var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")

cities_info.count()

// ERROR
java.io.IOException: (startline 1) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:304)
[...]

使用单义分析器:

var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")

cities_info.count()

// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]

尽管spark-csv仍然失败,但是python和R都能够正确解析两个文件。有任何建议使用spark-scala正确解析此csv文件吗?

0 个答案:

没有答案