我有一个非常大的.tsv
文件,它有一些奇怪的结构,它有这样的行:
CA 110123140 14228123056896 [{"id":"missing_required_gtin_future_disapproval","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:57:36Z"}, {"id":"missing_required_gtin_error","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:57:36Z"}]]
所以,正如你所看到的,它是4列,但第4列是json对象。
我可以使用以下命令将文件加载到spark上的df:
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load(file_path)
但是这个:
df.take(1)(0)(3)
的产率:
res53: Any = [{"id":"missing_required_gtin_future_disapproval","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:54:43Z"}, {"id":"missing_required_gtin_error","location":"gtin","severity":"critical","timestamp":"2017-02-19T20:54:43Z"}]
这使得(对我来说)难以解析为json对象。
理想情况下,我想要的是一个数据框,其中列是json对象的键:
"id" "location" "severity" "timestamp"
123 blabla critical 2017-02-19T20:54:43Z
234 blabla critical 2017-02-19T21:54:43Z
所以困难是双重的。
编辑:
我意识到我对自己真正想要的东西不是很清楚。 我实际上喜欢的是能够访问前三列,以便最终的df看起来像这样:
"country " "user" "object" "id" "location" "severity" "timestamp"
CA 98172937 87647563 123 blabla critical 2017-02-19T20:54:43Z
CA 98172937 87647563 234 blabla critical 2017-02-19T21:54:43Z
这是我认为最困难的部分,因为它涉及以某种方式在json对象的前3列上插入信息。
答案 0 :(得分:2)
您可以将数据读取为rdd,然后将json列转换为如下数据框:
val rdd = sc.textFile("path/filet.tsv").map(_.split('\t')(3))
val df = sqlContext.read.json(rdd)
df.printSchema
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- severity: string (nullable = true)
|-- timestamp: string (nullable = true)