我想表演一个"加入"在两个Spark DataFrames(Scala)上,我不想使用类似SQL的连接,而是要插入"加入"第二个DataFrame中的行作为第一个中的单个嵌套列。这样做的原因最终是使用嵌套结构写回JSON。我知道答案很可能已经在Stackoverflow上了,但有些搜索没有找到答案。
表1
0
表2
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
加入root
|-- BioProject: string (nullable = true)
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- abstract: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- dbGaP: string (nullable = true)
|-- description: string (nullable = true)
|-- external_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- submitter_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
table1.study_accession
。结果如下。请注意名为table2.accession
的新列,其中包含表2中行的记录等值。
study
答案 0 :(得分:4)
从我的理解到你的问题,假设你有两个数据帧
df1
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
和
df2
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
您必须将df2
的所有列合并到struct
列中,然后选择要加入的列和struct
列。在这里,我将col1
作为加入列
import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))
然后最后一步是join
(此处默认为inner join
)
df1.join(nestedDF2, Seq("col1"))
应该给你
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
|-- nested_df2: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: double (nullable = false)
我希望答案很有帮助