Question

所以，我正在从XML文件创建dataframe。它有关于经销商的一些信息，然后经销商有多辆汽车 - 每辆汽车是cars元素的子元素，由value元素表示 - 每个cars.value元素有各种汽车属性。所以我使用explode函数为每辆汽车创建一行，如下所示：

exploded_dealer = df.select('dealer_id',explode('cars.value').alias('a_car'))

现在我想获得cars.value

的各种属性

我这样做：

car_details_df = exploded_dealer.select('dealer_id','a_car.attribute1','a_car.attribute2')

这很好用。但有时cars.value元素没有我在查询中指定的所有属性。例如，某些cars.value元素可能只有 attribute1 - 然后运行上面的代码时会出现以下错误：

pyspark.sql.utils.AnalysisException：u＆＃34;无法解析＆＃39; attribute2＆＃39; 给定输入列：[dealer_id，attribute1];＆＃34;

我如何要求Spark执行相同的查询。但如果 attribute2 不存在，只需返回None

更新我读了我的数据如下：

initial_file_df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='dealer').load('<xml file location>')

exploded_dealer = df.select('financial_data',explode('cars.value').alias('a_car'))

Answer 1

由于您已经对模式做出了特定的假设，您可以做的最好的事情是使用nullable可选字段明确定义它，并在导入数据时使用它。

我们假设您希望文档类似于：

<rows>
    <row>
        <id>1</id>
        <objects>
            <object>
                <attribute1>...</attribute1>
                 ...
                <attributebN>...</attributeN>
            </object>
        </objects>
    </row>
</rows>

其中attribute1，attribute2，...，attributebN可能不存在于给定批次中，但您可以定义一组有限的选项和相应的类型。为简单起见，我们说只有两种选择：

{("attribute1", StringType), ("attribute2", LongType)}

您可以将架构定义为：

schema = StructType([
  StructField("objects", StructType([
    StructField("object", StructType([
      StructField("attribute1", StringType(), True),
      StructField("attribute2", LongType(), True)
    ]), True)
  ]), True),
  StructField("id", LongType(), True)
])

并与读者一起使用：

spark.read.schema(schema).option("rowTag", "row").format("xml").load(...)

它适用于任何属性子集（{∅，{attribute1}，{attribute2}，{attribute1，attribute2}}）。同时比依赖于模式推断更有效。

选择数据框中不存在的列

1 个答案: