我正在尝试简化一个复杂的XML结构,下面是XML文件-
<root>
<ATS name="exp_Change_Rec">
<EXP1>
<EXP1INT >
<ExPFLDs>
<ExPFLD precision="10" name="COL1" output="true"/>
<ExPFLD precision="20" name="COL2" output="true"/>
<ExPFLD precision="30" name="COL3" output="true"/>
<ExPFLD precision="40" name="COL4" output="true"/>
</ExPFLDs>
</EXP1INT>
</EXP1>
</ATS>
<ATS name="exp_Change_Flag">
<EXP1>
<EXP1INT >
<ExPFLDs>
<ExPFLD precision="10" name="COL5" output="true"/>
<ExPFLD precision="20" name="COL6" output="true"/>
<ExPFLD precision="30" name="COL7" output="true"/>
</ExPFLDs>
</EXP1INT>
</EXP1>
</ATS>
</root>
我期望输出为-
Name Value
exp_Change_Rec COL1
exp_Change_Rec COL2
exp_Change_Rec COL3
exp_Change_Rec COL4
exp_Change_Flag COL5
exp_Change_Flag COL6
exp_Change_Flag COL7
我正在通过databricks spark xml执行,但是它正在创建某种笛卡尔联接-
import org.apache.spark.sql.SparkSession
import com.databricks.spark.xml.
val df1 = spark.read.option("rowTag", "root").xml("file:///home/sv-infopcdq/spark/sample.xml")
val df2 = df1.withColumn("_name", explode($"ATS._name"))
df2.withColumn("COL_NAMES", explode($"ATS.EXP1.EXP1INT.ExPFLDs.ExPFLD")).show(100)
+--------------------+---------------+--------------------+
| ATS| _name| COL_NAMES|
+--------------------+---------------+--------------------+
|[[[[[[[, COL1, tr...| exp_Change_Rec|[[, COL1, true, 2...|
|[[[[[[[, COL1, tr...| exp_Change_Rec|[[, COL5, true,],...|
|[[[[[[[, COL1, tr...|exp_Change_Flag|[[, COL1, true, 2...|
|[[[[[[[, COL1, tr...|exp_Change_Flag|[[, COL5, true,],...|
在这里,我看到EXP_Change_Rec和exp_Change_Flag都发出了COL1。 有任何建议。
当我尝试爆炸一列时,输出工作正常,但是当我尝试爆炸所有列时,显示笛卡尔联接
就像我把输出设为
Name Value Precision
exp_Change_Rec COL1 10
exp_Change_Rec COL2 20
exp_Change_Rec COL3 30
exp_Change_Rec COL4 40
exp_Change_Flag COL5 10
exp_Change_Flag COL6 20
exp_Change_Flag COL7 30
如果我想扩展正确的答案以在其中包含“精度”,则它不起作用-
xml_df.withColumn("_name", ($"_name"))
.withColumn("COL_NAMES",explode($"EXP1.EXP1INT.ExPFLDs.ExPFLD._name")
.withColumn("COL_NAMES",explode($"EXP1.EXP1INT.ExPFLDs.ExPFLD._precision")).drop("EXP1")
.select($"_name".as("Name"), $"COL_NAMES".as("Value"))
任何解决方法都可以爆炸同一级别的多个列?
答案 0 :(得分:2)
首先,您需要更正rootTag
和rowTag
才能继续。由于您将rowtag
用作父/根标签ie(root
),因此将整个XML视为一条记录...多数民众赞成在其中获取单个记录块而不是单独的记录格式...请参阅下面的实施细节。
我使用了explode
函数,并选择了想要的确切列,如下所示...
val xml_df = spark.read.
format("com.databricks.spark.xml")
.option("rootTag", "root")
.option("rowTag", "ATS")
.option("nullValue","")
.load(f.getAbsolutePath)
xml_df.show
xml_df.printSchema()
val test = xml_df.withColumn("_name", ($"_name"))
.withColumn("COL_NAMES",explode($"EXP1.EXP1INT.ExPFLDs.ExPFLD._name")).drop("EXP1")
.select($"_name".as("Name"), $"COL_NAMES".as("Value"))
test.printSchema()
test.show(100,false)
您期望的输出:
+--------------------+---------------+
| EXP1| _name|
+--------------------+---------------+
|[[[[[, COL1, true...| exp_Change_Rec|
|[[[[[, COL5, true...|exp_Change_Flag|
+--------------------+---------------+
root
|-- EXP1: struct (nullable = true)
| |-- EXP1INT: struct (nullable = true)
| | |-- ExPFLDs: struct (nullable = true)
| | | |-- ExPFLD: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _name: string (nullable = true)
| | | | | |-- _output: boolean (nullable = true)
| | | | | |-- _precision: long (nullable = true)
|-- _name: string (nullable = true)
root
|-- Name: string (nullable = true)
|-- Value: string (nullable = true)
+---------------+-----+
|Name |Value|
+---------------+-----+
|exp_Change_Rec |COL1 |
|exp_Change_Rec |COL2 |
|exp_Change_Rec |COL3 |
|exp_Change_Rec |COL4 |
|exp_Change_Flag|COL5 |
|exp_Change_Flag|COL6 |
|exp_Change_Flag|COL7 |
+---------------+-----+
答案 1 :(得分:0)
爆炸多列的解决方案是使用
df.select(explode(arrays_zip($"col1",$col2))).select( $"col.*").show(20,false)
此解决方案从2.4+起可用