我有一个数据框,它是从mysql导入的
dataframe_mysql.show()
+----+---------+-------------------------------------------------------+
| id|accountid| xmldata|
+----+---------+-------------------------------------------------------+
|1001| 12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1002| 12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1003| 12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1004| 12347|<AccountSetup xmlns:xsi="test"><Customers test="test...|
+----+---------+-------------------------------------------------------+
在xmldata列中有xml标记,我需要将其解析为单独数据帧中的结构化数据。
以前,我仅将xml文件放在一个文本文件中,然后使用“ com.databricks.spark.xml”将其加载到spark数据框中。
spark-shell --packages com.databricks:spark-xml_2.10:0.4.1,
com.databricks:spark-csv_2.10:1.5.0
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag","Account").load("mypath/Account.xml")
结构化的最终输出
df.show()
+----------+--------------------+--------------------+--------------+--------------------+-------+....
| AcctNbr| AddlParties| Addresses|ApplicationInd| Beneficiaries|ClassCd|....
+----------+--------------------+--------------------+--------------+--------------------+-------+....
|AAAAAAAAAA|[[Securities Amer...|[WrappedArray([D,...| T|[WrappedArray([11...| 35|....
+----------+--------------------+--------------------+--------------+--------------------+-------+....
当我在数据框中包含xml内容时,请提出如何实现此目标的建议。
答案 0 :(得分:0)
由于您尝试将XML数据列拉到单独的DataFrame
中,因此仍可以使用spark-xml程序包中的代码。您只需要直接使用他们的阅读器即可。
case class Data(id: Int, accountid: Int, xmldata: String)
val df = Seq(
Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
).toDF
import com.databricks.spark.xml.XmlReader
val reader = new XmlReader()
// Set options using methods
reader.withRowTag("AccountSetup")
val rdd = df.select("xmldata").map(r => r.getString(0)).rdd
val xmlDF = reader.xmlRdd(spark.sqlContext, rdd)
但是,从长远来看,如philantrovert所建议的UDF与自定义XML解析相比可能更干净。读者类别here
的参考链接答案 1 :(得分:0)
我尝试了以下查询
val dff1 = Seq(
Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
).toDF
dff1.show()
val reader = new XmlReader().withRowTag("AccountSetup")
val xmlrdd = dff1.select("xmldata").map(a => a.getString(0)).rdd
xmlrdd.toDF("newRowXml").show()
val xmldf = reader.xmlRdd(sqlcontext, xmlrdd)
xmldf.show()
我得到了dff1.show()和xmlrdd.toDF(“ newRowXml”)。show()的输出
//dff1.show()
+----+---------+--------------------+
| id|accountid| xmldata|
+----+---------+--------------------+
|1001| 12345|<AccountSetup xml...|
|1002| 12345|<AccountSetup xml...|
|1003| 12345|<AccountSetup xml...|
+----+---------+--------------------+
xmlrdd.toDF("newRowXml").show()
+--------------------+
| newRowXml|
+--------------------+
|<AccountSetup xml...|
|<AccountSetup xml...|
|<AccountSetup xml...|
+--------------------+
18/09/20 19:30:29 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
18/09/20 19:30:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/09/20 19:30:29 INFO MemoryStore: MemoryStore cleared
18/09/20 19:30:29 INFO BlockManager: BlockManager stopped
18/09/20 19:30:29 INFO BlockManagerMaster: BlockManagerMaster stopped
18/09/20 19:30:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/09/20 19:30:29 INFO SparkContext: Successfully stopped SparkContext
18/09/20 19:30:29 INFO ShutdownHookManager: Shutdown hook called
18/09/20 19:30:29 INFO ShutdownHookManager: Deleting directory C:\Users\rajkiranu\AppData\Local\Temp\spark-16433b5e-01b7-472b-9b88-fea0a67a991a
Process finished with exit code 1
无法看到xmldf.show()