Spark合并数据帧与不匹配的模式没有额外的磁盘IO

时间:2016-10-05 08:35:47

标签: scala apache-spark

我想将2个数据帧与(可能)不匹配的模式合并

org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> A.unionAll(B)

会导致:

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;

我想从Spark内部做到这一点。 但是,Spark文档只建议将整个2个数据帧写入目录,然后使用spark.read.option("mergeSchema", "true")将其读回。

link to docs

所以联盟并没有帮助我,文档也没有。如果可能的话,我想把这个额外的I / O从我的工作中解脱出来。我错过了一些未记载的信息,还是不可能(还)?

8 个答案:

答案 0 :(得分:7)

您可以将空列附加到第B帧和第2联合帧之后:

import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){ 
   C = A.withColumn(field.name, expr("null")); 
} 
A.unionAll(C)

答案 1 :(得分:4)

默认情况下禁用镶木地板模式合并,请通过以下方式启用此选项:

(1) set global option: spark.sql.parquet.mergeSchema=true

(2) write code: sqlContext.read.option("mergeSchema", "true").parquet("my.parquet")

答案 2 :(得分:2)

这是一个pyspark解决方案。

它假设如果合并不能发生,因为一个数据帧缺少另一个中包含的列,那么正确的做法是添加缺少的值为空值。

另一方面,如果由于两个数据帧共享一个具有冲突类型或可空性的列而无法进行合并,那么正确的做法就是引发一个TypeError(因为那可能是你的冲突)想知道。)

def harmonize_schemas_and_combine(df_left, df_right):
    left_types = {f.name: f.dataType for f in df_left.schema}
    right_types = {f.name: f.dataType for f in df_right.schema}
    left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
    right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)

    # First go over left-unique fields
    for l_name, l_type, l_nullable in left_fields.difference(right_fields):
        if l_name in right_types:
            r_type = right_types[l_name]
            if l_type != r_type:
                raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
            else:
                raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s"  % (l_name, l_nullable, not(l_nullable))
        df_right = df_right.withColumn(l_name, lit(None).cast(l_type))

    # Now go over right-unique fields
    for r_name, r_type, r_nullable in right_fields.difference(left_fields):
        if r_name in left_types:
            l_type = right_types[r_name]
            if r_type != l_type:
                raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
            else:
                raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
        df_left = df_left.withColumn(r_name, lit(None).cast(r_type))       
    return df_left.union(df_right)

答案 3 :(得分:1)

感谢@conradlee!我修改了您的解决方案以通过添加强制转换和删除可空性检查来允许并集。它对我有用。

def harmonize_schemas_and_combine(df_left, df_right):
    '''
    df_left is the main df; we try to append the new df_right to it. 
    Need to do three things here: 
    1. Set other claim/clinical features to NULL
    2. Align schemas (data types)
    3. Align column orders
    '''
    left_types = {f.name: f.dataType for f in df_left.schema}
    right_types = {f.name: f.dataType for f in df_right.schema}
    left_fields = set((f.name, f.dataType) for f in df_left.schema)
    right_fields = set((f.name, f.dataType) for f in df_right.schema)
#     import pdb; pdb.set_trace() #pdb debugger

    # I. First go over left-unique fields: 
    # For columns in the main df, but not in the new df: add it as Null
    # For columns in both df but w/ different datatypes, use casting to keep them consistent w/ main df (Left)
    for l_name, l_type in left_fields.difference(right_fields): #1. find what Left has, Right doesn't
        if l_name in right_types: #2A. if column is in both, then something's off w/ the schema 
            r_type = right_types[l_name] #3. tell me what's this column's type in Right
            df_right = df_right.withColumn(l_name,df_right[l_name].cast(l_type)) #4. keep them consistent w/ main df (Left)
            print("Casting magic happened on column %s: Left type: %s, Right type: %s. Both are now: %s." % (l_name, l_type, r_type, l_type))
        else: #2B. if Left column is not in Right, add a NULL column to Right df
            df_right = df_right.withColumn(l_name, F.lit(None).cast(l_type))

    # Make sure Right columns are in the same order of Left
    df_right = df_right.select(df_left.columns)

    return df_left.union(df_right)

答案 4 :(得分:0)

这是对此的另一种解决方案。我使用rdd联合,因为dataFrame联合操作不支持多个dataFrames。 注意-不应将其用于合并具有不同架构的许多dataFrame。将空列添加到dataFrames的成本将很快导致内存不足错误。 (即:尝试合并缺少10列的1000个dataFrames将导致10,000个转换) 如果您的用例是从具有不同架构的存储中读取数据帧,而该架构是由具有不同架构的多个路径组成的,那么更好的选择是首先将数据另存为镶木地板,然后在出现以下情况时使用“ mergeSchema”选项读取dataFrame。

def unionDataFramesAndMergeSchema(spark, dfsList):
'''
This function can perform a union between x dataFrames with different schemas.
Non-existing columns will be filled with null.
Note: If a column exist in 2 dataFrames with different types, an exception will be thrown.
:example:
>>> df1 = spark.createDataFrame([
>>>    {
>>>        'A': 1,
>>>        'B': 1,
>>>        'C': 1
>>>    }])
>>> df2 = spark.createDataFrame([
>>>    {
>>>        'A': 2,
>>>        'C': 2,
>>>        'DNew' : 2
>>>    }])
>>> unionDataFramesAndMergeSchema(spark,[df1,df2]).show()
>>> +---+----+---+----+
>>> |  A|   B|  C|DNew|
>>> +---+----+---+----+
>>> |  2|null|  2|   2|
>>> |  1|   1|  1|null|
>>> +---+----+---+----+
:param spark: The Spark session.
:param dfsList: A list of dataFrames.
:return: A union of all dataFrames, with schema merged.
'''
if len(dfsList) == 0:
    raise ValueError("DataFrame list is empty.")
if len(dfsList) == 1:
    logging.info("The list contains only one dataFrame, no need to perform union.")
    return dfsList[0]

logging.info("Will perform union between {0} dataFrames...".format(len(dfsList)))

columnNamesAndTypes = {}
logging.info("Calculating unified column names and types...")
for df in dfsList:
    for columnName, columnType in dict(df.dtypes).iteritems():
        if columnNamesAndTypes.has_key(columnName) and columnNamesAndTypes[columnName] != columnType:
            raise ValueError(
                "column '{0}' exist in at least 2 dataFrames with different types ('{1}' and '{2}'"
                    .format(columnName, columnType, columnNamesAndTypes[columnName]))
        columnNamesAndTypes[columnName] = columnType
logging.info("Unified column names and types: {0}".format(columnNamesAndTypes))

logging.info("Adding null columns in dataFrames if needed...")
newDfsList = []
for df in dfsList:
    newDf = df
    dfTypes = dict(df.dtypes)
    for columnName, columnType in dict(columnNamesAndTypes).iteritems():
        if not dfTypes.has_key(columnName):
            # logging.info("Adding null column for '{0}'.".format(columnName))
            newDf = newDf.withColumn(columnName, func.lit(None).cast(columnType))
    newDfsList.append(newDf)

dfsWithOrderedColumnsList = [df.select(columnNamesAndTypes.keys()) for df in newDfsList]
logging.info("Performing a flat union between all dataFrames (as rdds)...")
allRdds = spark.sparkContext.union([df.rdd for df in dfsWithOrderedColumnsList])
return allRdds.toDF()

答案 5 :(得分:0)

如果您从存储文件中读取两个数据帧,则可以使用预定义的架构:

mutate(mydf, group = 1:n()) %>% 
separate_rows(Column1, sep = "\\s\\|\\s") %>% 
filter(grepl(x = Column1, pattern = "^[A-Z]")) %>% 
complete(group = 1:nrow(mydf))

  group Column1
  <int> <chr>  
1     1 NA     
2     2 E1.3   
3     3 G1.2   
4     4 NA     
5     5 I.1    
6     6 H1.256

请注意,文件public class ResponceBody { public String s1; public Integer n1; } val schemaForRead = StructType(List( StructField("userId", LongType,true), StructField("dtEvent", LongType,true), StructField("goodsId", LongType,true) )) val dfA = spark.read.format("parquet").schema(schemaForRead).load("/tmp/file1.parquet") val dfB = spark.read.format("parquet").schema(schemaForRead).load("/tmp/file2.parquet") val dfC = dfA.union(dfB) 中的架构可以不同,并且可以file1的形式不同。如果file2不包含来自schemaForRead的字段,则数据帧file1将具有带有schemaForRead的空字段。如果文件包含A数据框中未显示的其他字段,则不会包含该字段。

答案 6 :(得分:0)

Scala的版本在这里也回答了- (Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema)-

需要合并数据框的列表。.如果所有数据框中的相同命名列都应具有相同的数据类型。

def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {

    /**
     * This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
     * Creates a Unioned DataFrame
     */

    import spark.implicits._

    val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct

    def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
      allCols.toList.map(x => x match {
        case x if myCols.contains(x) => col(x)
        case _                       => lit(null).as(x)
      })
    }

    // Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases

    val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)

    val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)

    DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))

  }

这是它的样本测试-


    val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
    val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
    unionPro(List(aDF, bDF), spark).show

哪个输出为-

+----+----+----+-----+
|Name|  ID| Sal|Deptt|
+----+----+----+-----+
|   A|   1|null| null|
|   B|   2|null| null|
|   C|null|   1|   D1|
|   D|null|   2|   D2|
+----+----+----+-----+

答案 7 :(得分:0)

如果您使用的是 spark 版本 > 2.3.0,那么您可以使用 unionByName 内置函数来获取所需的输出。

链接到包含 unionByName 代码的 Git 存储库: https://github.com/apache/spark/blame/cee4ecbb16917fa85f02c635925e2687400aa56b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1894