Question

我正在设置DataBrick，以比较和对比来自多个来源的数据。一些数据以CSV文件格式，一些数据为JSON格式，而其他数据则为Google Earth KML文件。最后一个确实是一个挑战。我正在尝试使用数据上载功能上载XML数据，但是DataBricks无法从XML字符串创建表。将XML插入DataBricks表的过程是什么？

Answer 1

在工作空间中使用spark-xml库的最佳方法。

在“ maven / spark软件包”部分中搜索spark-xml，然后按照此步骤将其添加到库https://docs.databricks.com/user-guide/libraries.html#create-a-library

将库附加到群集

https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster

最后使用以下代码读取数据块中的xml数据

xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')

这也是执行相同操作的python代码：

import xml.etree.ElementTree as ET
xmlfiles = dbutils.fs.ls(storage_mount_name)

##Get attribute names (for now I took all leafs of the xml structure)
firstfile = xmlfiles[0].path.replace('dbfs:','/dbfs')
root = ET.parse(firstfile).getroot()
attributes = [node.tag for node in root.iter() if len(node)==0]
clean_attribute_names = [re.sub(r'\{.*\}', '', a) for a in attributes]

#Create Dataframe and save it as csv
df = pd.DataFrame(columns=clean_attribute_names, index=xmlfiles)
for xf in xmlfiles:
    afile = xf.path.replace('dbfs:','/dbfs')
    root = ET.parse(afile).getroot()
    df.loc[afile] = [node.text for node in root.iter() if node.tag in attributes]

将XML数据从Google Earth KML文件上传到DataBricks

1 个答案: