我有很多行的TSV文件。许多行工作正常,但是我遇到了使用以下行的问题:
tt7841930 tvEpisode "Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded 0 2018 \N 24 Animation,Family
我使用Spark和Scala以便将文件加载到DataFrame中:
val titleBasicsDf = spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("delimiter", " ")
.csv("title.basics.tsv.gz")
结果我收到:
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tconst |titleType|primaryTitle |originalTitle|isAdult|startYear|endYear|runtimeMinutes |genres|averageRating|numVotes|parentTconst|seasonNumber|episodeNumber|
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tt7841930|tvEpisode|"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded|0 |2018 |\N |24 |Animation,Family|null |null |null |tt4947580 |6 |2 |
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
如您所见,该行中的以下数据:
"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded
对于primaryTitle
和originalTitle
列,没有正确地分为两个不同的值,并且primaryTitle
包含两个值:
{
"runtimeMinutes":"Animation,Family",
"tconst":"tt7841930",
"seasonNumber":"6",
"titleType":"tvEpisode",
"averageRating":null,
"originalTitle":"0",
"parentTconst":"tt4947580",
"startYear":null,
"endYear":"24",
"numVotes":null,
"episodeNumber":"2",
"primaryTitle":"\"Stop and Hear the Cicadas/Cold-Blooded\t\"Stop and Hear the Cicadas/Cold-Blooded",
"isAdult":2018,
"genres":null
}
我在做错什么,以及如何配置Spark以正确理解和分割此行?正如我之前提到的,此文件中的许多其他行都正确地分成了适当的列。
答案 0 :(得分:1)
我在这里找到了答案:https://github.com/databricks/spark-csv/issues/89
关闭双引号字符的默认转义的方法 (“)和反斜杠字符()-即避免为所有人转义 字符,必须添加一个.option()方法调用 .write()方法调用之后的正确参数。的目标 option()方法调用用于更改csv()方法“查找”的方式 引号内容的实例。至 为此,您必须更改“引号”实际含义的默认值; 即从双引号字符中更改所需的字符 (“)转换为Unicode” \ u0000“字符(实质上是提供Unicode 假定它永远不会出现在文档中的NUL字符。
以下魔术选项可以达到目的:
.option("quote", "\u0000")