Question

from pyspark.sql.functions import *
def flatten_df(nested_df):
    exist = True
    while exist:
        flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
        nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
        if len(nested_cols) > 0:
          print(nested_cols)
          flat_df = nested_df.select(flat_cols +
                                     [col("`"+nc+'`.`'+c+"`").alias((nc+'_'+c).replace(".","_"))
                                      for nc in nested_cols
                                      for c in nested_df.select("`"+nc+'`.*').columns])
          nested_df=flat_df
          #break
        else:
          exist = False
    return flat_df
df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "GetDocument").load("/FileStore/tables/test.xml")
df1=flatten_df(df)

这是我用来扁平化xml文档的代码。基本上，我想使用带有嵌套xml的xml，并将其全部展平为没有任何结构化数据类型的单行，因此每个值都是一列。上面的代码适用于我已经完成的测试用例，但是我尝试了一个非常大的XML，经过几轮扁平化（在while循环中），它破裂并显示以下错误：

'Ambiguous reference to fields StructField(_Id,StringType,true), StructField(_id,StringType,true);'

我认为是因为它试图创建2个具有相同名称的独立列。如何避免这种情况，但对任何XML保持我的代码通用？

需要注意的一件事是，可以将数组作为列的数据类型，我将在以后的步骤中分解这些数组以分隔行。

更新示例

原始DF-

 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- id: string(nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

功能后的DF-

 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children_id: string(nullable = true)
 |-- children_att: array (nullable = true)
 |   |-- children_att_element_Order: long (nullable = true)
 |   |-- children_att_element_attval: string (nullable = true)

Answer 1

我遇到了类似的问题，并且能够按照以下步骤解析我的XML文件

在Databricks上安装以下Maven库：“ com.databricks：spark-xml_2.10：0.4.1”
使用以下路径在DBFS上上传文件：FileStore>表> xml> sample_data

运行以下代码：

data = spark.read.format("com.databricks.spark.xml").option("rootTag", "col1").option("rowTag", "col2").option("rowTag", "col3").load("dbfs:/FileStore/tables/sample_data.xml")

显示（数据）

在Spark中展平XML数据框

1 个答案: