Question

我正在使用Spark（2.2）处理Spark SQL，并使用Java API从CSV文件加载数据。

在CSV文件中，单元格内有引号，列分隔符是管道|。

行示例：result = []; // Just get all numbers "SomeT1extSomeT2extSomeT3ext".replace(/(\d+?)/g, function(wholeMatch, num) { // act here or after the loop... console.log(result.push(num)); return wholeMatch; }); console.log(result); // ['1', '2', '3']

这是我的代码，用于读取CSV并返回数据集：

2012|"Hello|World"

这就是我得到的

session = SparkSession.builder().getOrCreate();
Dataset<Row>=session.read().option("header", "true").option("delimiter", |).csv(filePath);

预期结果是这样

+-----+--------------+--------------------------+
|Year |       c1     |               c2         |
+-----+--------------+--------------------------+
|2012 |Hello|World   +              null        |
+-----+--------------+--------------------------+

我唯一想到的就是删除逗号“”，但这是没有问题的，因为我不想更改单元格的值。

任何想法，我都会感激的。

Answer 1

尝试一下：

 Dataset<Row> test = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("quote", " ")
.load(filePath);

Spark：使用定界符分割不适用于逗号

1 个答案: