Spark TSV文件和不正确的列吐

时间:2018-11-25 09:37:50

标签: scala apache-spark apache-spark-sql

我有很多行的TSV文件。许多行工作正​​常,但是我遇到了使用以下行的问题:

tt7841930   tvEpisode   "Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded 0   2018    \N  24  Animation,Family

我使用Spark和Scala以便将文件加载到DataFrame中:

val titleBasicsDf = spark.read
  .format("org.apache.spark.csv")
  .option("header", true)
  .option("inferSchema", true)
  .option("delimiter", "    ")
  .csv("title.basics.tsv.gz")

结果我收到:

+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tconst   |titleType|primaryTitle                                                                   |originalTitle|isAdult|startYear|endYear|runtimeMinutes  |genres|averageRating|numVotes|parentTconst|seasonNumber|episodeNumber|
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tt7841930|tvEpisode|"Stop and Hear the Cicadas/Cold-Blooded    "Stop and Hear the Cicadas/Cold-Blooded|0            |2018   |\N       |24     |Animation,Family|null  |null         |null    |tt4947580   |6           |2            |
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+

如您所见,该行中的以下数据:

"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded

对于primaryTitleoriginalTitle列,没有正确地分为两个不同的值,并且primaryTitle包含两个值:

{
   "runtimeMinutes":"Animation,Family",
   "tconst":"tt7841930",
   "seasonNumber":"6",
   "titleType":"tvEpisode",
   "averageRating":null,
   "originalTitle":"0",
   "parentTconst":"tt4947580",
   "startYear":null,
   "endYear":"24",
   "numVotes":null,
   "episodeNumber":"2",
   "primaryTitle":"\"Stop and Hear the Cicadas/Cold-Blooded\t\"Stop and Hear the Cicadas/Cold-Blooded",
   "isAdult":2018,
   "genres":null
}

我在做错什么,以及如何配置Spark以正确理解和分割此行?正如我之前提到的,此文件中的许多其他行都正确地分成了适当的列。

1 个答案:

答案 0 :(得分:1)

我在这里找到了答案:https://github.com/databricks/spark-csv/issues/89

  

关闭双引号字符的默认转义的方法   (“)和反斜杠字符()-即避免为所有人转义   字符,必须添加一个.option()方法调用   .write()方法调用之后的正确参数。的目标   option()方法调用用于更改csv()方法“查找”的方式   引号内容的实例。至   为此,您必须更改“引号”实际含义的默认值;   即从双引号字符中更改所需的字符   (“)转换为Unicode” \ u0000“字符(实质上是提供Unicode   假定它永远不会出现在文档中的NUL字符。

以下魔术选项可以达到目的:

.option("quote", "\u0000")