Question

为了处理我拥有的数据，我之前正在提取模式，因此当我读取数据集时，我提供模式而不是通过推断模式的昂贵步骤。

为了构建架构，我需要将几个不同的架构合并到最终的架构中，所以我一直在使用union (++)和distinct方法，但我一直得到org.apache.spark.sql.AnalysisException: Duplicate column(s)例外。

例如，假设我们在以下结构中有两个模式：

val schema1 = StructType(StructField("A", StructType(
    StructField("i", StringType, true) :: Nil
    ), true) :: Nil)

val schema2 = StructType(StructField("A", StructType(
    StructField("i", StringType, true) :: Nil
    ), true) :: Nil)

val schema3 = StructType(StructField("A", StructType(
    StructField("i", StringType, true) ::
    StructField("ii", StringType, true) :: Nil
    ), true) :: Nil)

val final_schema = (schema1 ++ schema2 ++ schema3).distinct

println(final_schema)

输出：

StructType(
    StructField(A,StructType(
         StructField(i,StringType,true)),true), 
    StructField(A,StructType(
        StructField(i,StringType,true),    
        StructField(ii,StringType,true)),true))

我了解只有distinct才能过滤掉与其他架构完全匹配的架构结构。但是我希望结果看起来像这样：

StructType(
    StructField(A,StructType(
        StructField(i,StringType,true),    
        StructField(ii,StringType,true)),true))

其中所有得到的＆＃34;组合＆＃34;到一个架构。我已经对scala documentation中的所有方法进行了筛选，但我似乎无法找到解决此问题的正确方法。有什么想法吗？

编辑：

最终目标是将final_schema提供给sqlContext.read.schema并使用read方法读取JSON字符串的RDD。

Answer 1

尝试这样的事情：

(schema1 ++ schema2 ++ schema3).groupBy(getKey).map(_._2.head)

其中getKey是一个函数，它从模式到要考虑合并的属性（例如列名或子字段的名称）。在map函数中，您可以使用头部或使用更复杂的函数来保留特定的模式。

Answer 2

使用Scala闪烁：

#This code works:
x=["Decomplete asd"]
y=[]
z=[]
for i in x:
    if "De" in i:
        y.append(i)
        print(y)
    if "comp" in i:
        z.append(i)
        print(z)

# This one does not:
x=["Decomplete asd"]
y=[]
z=[]
if "De" in x:
    y.append(x)
    print(y)
if "comp" in x:
    z.append(x)
    print(z)

使用Java闪烁：

val consolidatedSchema = test1Df.schema.++:(test2Df.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)

结合Spark模式而不重复？

2 个答案: