Question

我想表演一个＆＃34;加入＆＃34;在两个Spark DataFrames（Scala）上，我不想使用类似SQL的连接，而是要插入＆＃34;加入＆＃34;第二个DataFrame中的行作为第一个中的单个嵌套列。这样做的原因最终是使用嵌套结构写回JSON。我知道答案很可能已经在Stackoverflow上了，但有些搜索没有找到答案。

表1

表2

root
 |-- Insdc: string (nullable = true)
 |-- LastMetaUpdate: string (nullable = true)
 |-- LastUpdate: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Received: string (nullable = true)
 |-- ReplacedBy: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- accession: string (nullable = true)
 |-- alias: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- tag: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- center_name: string (nullable = true)
 |-- design_description: string (nullable = true)
 |-- geo_accession: string (nullable = true)
 |-- instrument_model: string (nullable = true)
 |-- library_construction_protocol: string (nullable = true)
 |-- library_name: string (nullable = true)
 |-- library_selection: string (nullable = true)
 |-- library_source: string (nullable = true)
 |-- library_strategy: string (nullable = true)
 |-- paired: boolean (nullable = true)
 |-- platform: string (nullable = true)
 |-- read_spec: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- base_coord: long (nullable = true)
 |    |    |-- read_class: string (nullable = true)
 |    |    |-- read_index: long (nullable = true)
 |    |    |-- read_type: string (nullable = true)
 |-- sample_accession: string (nullable = true)
 |-- spot_length: long (nullable = true)
 |-- study_accession: string (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- title: string (nullable = true)

study

Answer 1

从我的理解到你的问题，假设你有两个数据帧

df1 
root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = false)
 |-- col3: double (nullable = false)

和

df2
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: double (nullable = false)

您必须将df2的所有列合并到struct列中，然后选择要加入的列和struct列。在这里，我将col1作为加入列

import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))

然后最后一步是join（此处默认为inner join）

df1.join(nestedDF2, Seq("col1"))

应该给你

root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = false)
 |-- col3: double (nullable = false)
 |-- nested_df2: struct (nullable = false)
 |    |-- col1: string (nullable = true)
 |    |-- col2: string (nullable = true)
 |    |-- col3: double (nullable = false)

我希望答案很有帮助

如何通过加入Spark创建嵌套列？

1 个答案: