如何通过加入Spark创建嵌套列?

时间:2018-01-11 23:57:24

标签: scala apache-spark apache-spark-sql

我想表演一个"加入"在两个Spark DataFrames(Scala)上,我不想使用类似SQL的连接,而是要插入"加入"第二个DataFrame中的行作为第一个中的单个嵌套列。这样做的原因最终是使用嵌套结构写回JSON。我知道答案很可能已经在Stackoverflow上了,但有些搜索没有找到答案。

表1

0

表2

root
 |-- Insdc: string (nullable = true)
 |-- LastMetaUpdate: string (nullable = true)
 |-- LastUpdate: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Received: string (nullable = true)
 |-- ReplacedBy: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- accession: string (nullable = true)
 |-- alias: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- tag: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- center_name: string (nullable = true)
 |-- design_description: string (nullable = true)
 |-- geo_accession: string (nullable = true)
 |-- instrument_model: string (nullable = true)
 |-- library_construction_protocol: string (nullable = true)
 |-- library_name: string (nullable = true)
 |-- library_selection: string (nullable = true)
 |-- library_source: string (nullable = true)
 |-- library_strategy: string (nullable = true)
 |-- paired: boolean (nullable = true)
 |-- platform: string (nullable = true)
 |-- read_spec: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- base_coord: long (nullable = true)
 |    |    |-- read_class: string (nullable = true)
 |    |    |-- read_index: long (nullable = true)
 |    |    |-- read_type: string (nullable = true)
 |-- sample_accession: string (nullable = true)
 |-- spot_length: long (nullable = true)
 |-- study_accession: string (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- title: string (nullable = true)

加入root |-- BioProject: string (nullable = true) |-- Insdc: string (nullable = true) |-- LastMetaUpdate: string (nullable = true) |-- LastUpdate: string (nullable = true) |-- Published: string (nullable = true) |-- Received: string (nullable = true) |-- ReplacedBy: string (nullable = true) |-- Status: string (nullable = true) |-- Type: string (nullable = true) |-- abstract: string (nullable = true) |-- accession: string (nullable = true) |-- alias: string (nullable = true) |-- attributes: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- tag: string (nullable = true) | | |-- value: string (nullable = true) |-- dbGaP: string (nullable = true) |-- description: string (nullable = true) |-- external_id: struct (nullable = true) | |-- id: string (nullable = true) | |-- namespace: string (nullable = true) |-- submitter_id: struct (nullable = true) | |-- id: string (nullable = true) | |-- namespace: string (nullable = true) |-- tags: array (nullable = true) | |-- element: string (containsNull = true) |-- title: string (nullable = true) table1.study_accession。结果如下。请注意名为table2.accession的新列,其中包含表2中行的记录等值。

study

1 个答案:

答案 0 :(得分:4)

从我的理解到你的问题,假设你有两个数据帧

df1 
root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = false)
 |-- col3: double (nullable = false)

df2
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: double (nullable = false)

您必须将df2的所有列合并到struct列中,然后选择要加入的列和struct列。在这里,我将col1作为加入列

import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))

然后最后一步是join(此处默认为inner join

df1.join(nestedDF2, Seq("col1"))

应该给你

root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = false)
 |-- col3: double (nullable = false)
 |-- nested_df2: struct (nullable = false)
 |    |-- col1: string (nullable = true)
 |    |-- col2: string (nullable = true)
 |    |-- col3: double (nullable = false)

我希望答案很有帮助