org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> A.unionAll(B)


org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;

我想从Spark内部做到这一点。 但是,Spark文档只建议将整个2个数据帧写入目录,然后使用"mergeSchema", "true")将其读回。

所以联盟并没有帮助我,文档也没有。如果可能的话,我想把这个额外的I / O从我的工作中解脱出来。我错过了一些未记载的信息,还是不可能(还)?

import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){ 
   C = A.withColumn(, expr("null")); 

(1) set global option: spark.sql.parquet.mergeSchema=true

(2) write code:"mergeSchema", "true").parquet("my.parquet")

def harmonize_schemas_and_combine(df_left, df_right):
    left_types = { f.dataType for f in df_left.schema}
    right_types = { f.dataType for f in df_right.schema}
    left_fields = set((, f.dataType, f.nullable) for f in df_left.schema)
    right_fields = set((, f.dataType, f.nullable) for f in df_right.schema)

    # First go over left-unique fields
    for l_name, l_type, l_nullable in left_fields.difference(right_fields):
        if l_name in right_types:
            r_type = right_types[l_name]
            if l_type != r_type:
                raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
                raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s"  % (l_name, l_nullable, not(l_nullable))
        df_right = df_right.withColumn(l_name, lit(None).cast(l_type))

    # Now go over right-unique fields
    for r_name, r_type, r_nullable in right_fields.difference(left_fields):
        if r_name in left_types:
            l_type = right_types[r_name]
            if r_type != l_type:
                raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
                raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
        df_left = df_left.withColumn(r_name, lit(None).cast(r_type))       
    return df_left.union(df_right)

def harmonize_schemas_and_combine(df_left, df_right):
    df_left is the main df; we try to append the new df_right to it. 
    Need to do three things here: 
    1. Set other claim/clinical features to NULL
    2. Align schemas (data types)
    3. Align column orders
    left_types = { f.dataType for f in df_left.schema}
    right_types = { f.dataType for f in df_right.schema}
    left_fields = set((, f.dataType) for f in df_left.schema)
    right_fields = set((, f.dataType) for f in df_right.schema)
#     import pdb; pdb.set_trace() #pdb debugger

    # I. First go over left-unique fields: 
    # For columns in the main df, but not in the new df: add it as Null
    # For columns in both df but w/ different datatypes, use casting to keep them consistent w/ main df (Left)
    for l_name, l_type in left_fields.difference(right_fields): #1. find what Left has, Right doesn't
        if l_name in right_types: #2A. if column is in both, then something's off w/ the schema 
            r_type = right_types[l_name] #3. tell me what's this column's type in Right
            df_right = df_right.withColumn(l_name,df_right[l_name].cast(l_type)) #4. keep them consistent w/ main df (Left)
            print("Casting magic happened on column %s: Left type: %s, Right type: %s. Both are now: %s." % (l_name, l_type, r_type, l_type))
        else: #2B. if Left column is not in Right, add a NULL column to Right df
            df_right = df_right.withColumn(l_name, F.lit(None).cast(l_type))

    # Make sure Right columns are in the same order of Left
    df_right =

    return df_left.union(df_right)

这是对此的另一种解决方案。我使用rdd联合,因为dataFrame联合操作不支持多个dataFrames。 注意-不应将其用于合并具有不同架构的许多dataFrame。将空列添加到dataFrames的成本将很快导致内存不足错误。 (即:尝试合并缺少10列的1000个dataFrames将导致10,000个转换) 如果您的用例是从具有不同架构的存储中读取数据帧,而该架构是由具有不同架构的多个路径组成的,那么更好的选择是首先将数据另存为镶木地板,然后在出现以下情况时使用“ mergeSchema”选项读取dataFrame。

def unionDataFramesAndMergeSchema(spark, dfsList):
This function can perform a union between x dataFrames with different schemas.
Non-existing columns will be filled with null.
Note: If a column exist in 2 dataFrames with different types, an exception will be thrown.
>>> df1 = spark.createDataFrame([
>>>    {
>>>        'A': 1,
>>>        'B': 1,
>>>        'C': 1
>>>    }])
>>> df2 = spark.createDataFrame([
>>>    {
>>>        'A': 2,
>>>        'C': 2,
>>>        'DNew' : 2
>>>    }])
>>> unionDataFramesAndMergeSchema(spark,[df1,df2]).show()
>>> +---+----+---+----+
>>> |  A|   B|  C|DNew|
>>> +---+----+---+----+
>>> |  2|null|  2|   2|
>>> |  1|   1|  1|null|
>>> +---+----+---+----+
:param spark: The Spark session.
:param dfsList: A list of dataFrames.
:return: A union of all dataFrames, with schema merged.
if len(dfsList) == 0:
    raise ValueError("DataFrame list is empty.")
if len(dfsList) == 1:"The list contains only one dataFrame, no need to perform union.")
    return dfsList[0]"Will perform union between {0} dataFrames...".format(len(dfsList)))

columnNamesAndTypes = {}"Calculating unified column names and types...")
for df in dfsList:
    for columnName, columnType in dict(df.dtypes).iteritems():
        if columnNamesAndTypes.has_key(columnName) and columnNamesAndTypes[columnName] != columnType:
            raise ValueError(
                "column '{0}' exist in at least 2 dataFrames with different types ('{1}' and '{2}'"
                    .format(columnName, columnType, columnNamesAndTypes[columnName]))
        columnNamesAndTypes[columnName] = columnType"Unified column names and types: {0}".format(columnNamesAndTypes))"Adding null columns in dataFrames if needed...")
newDfsList = []
for df in dfsList:
    newDf = df
    dfTypes = dict(df.dtypes)
    for columnName, columnType in dict(columnNamesAndTypes).iteritems():
        if not dfTypes.has_key(columnName):
            #"Adding null column for '{0}'.".format(columnName))
            newDf = newDf.withColumn(columnName, func.lit(None).cast(columnType))

dfsWithOrderedColumnsList = [ for df in newDfsList]"Performing a flat union between all dataFrames (as rdds)...")
allRdds = spark.sparkContext.union([df.rdd for df in dfsWithOrderedColumnsList])
return allRdds.toDF()

mutate(mydf, group = 1:n()) %>% 
separate_rows(Column1, sep = "\\s\\|\\s") %>% 
filter(grepl(x = Column1, pattern = "^[A-Z]")) %>% 
complete(group = 1:nrow(mydf))

  group Column1
  <int> <chr>  
1     1 NA     
2     2 E1.3   
3     3 G1.2   
4     4 NA     
5     5 I.1    
6     6 H1.256

请注意,文件public class ResponceBody { public String s1; public Integer n1; } val schemaForRead = StructType(List( StructField("userId", LongType,true), StructField("dtEvent", LongType,true), StructField("goodsId", LongType,true) )) val dfA ="parquet").schema(schemaForRead).load("/tmp/file1.parquet") val dfB ="parquet").schema(schemaForRead).load("/tmp/file2.parquet") val dfC = dfA.union(dfB) 中的架构可以不同,并且可以file1的形式不同。如果file2不包含来自schemaForRead的字段,则数据帧file1将具有带有schemaForRead的空字段。如果文件包含A数据框中未显示的其他字段,则不会包含该字段。

Scala的版本在这里也回答了- (Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema)-


def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {

     * This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
     * Creates a Unioned DataFrame

    import spark.implicits._

    val MasterColList: Array[String] =, y) => (x.union(y))).distinct

    def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = { => x match {
        case x if myCols.contains(x) => col(x)
        case _                       => lit(null).as(x)

    // Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases

    val masterSchema = StructType(, y) => (x.union(y))).groupBy(

    val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*) =>, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))



    val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
    val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
    unionPro(List(aDF, bDF), spark).show


|Name|  ID| Sal|Deptt|
|   A|   1|null| null|
|   B|   2|null| null|
|   C|null|   1|   D1|
|   D|null|   2|   D2|

如果您使用的是 spark 版本 > 2.3.0,那么您可以使用 unionByName 内置函数来获取所需的输出。

链接到包含 unionByName 代码的 Git 存储库: