Pyspark问题使用com.databricks:spark-xml

时间:2018-11-30 19:17:21

标签: apache-spark pyspark jupyter-notebook

我正在尝试通过com.databricks:spark-xml依靠pyspark来推动一些学术POC的工作。目标是将Stack Exchange数据转储xml格式(https://archive.org/details/stackexchange)加载到pyspark df。

它的工作方式与带有正确标签的xml格式正确的超级按钮一样,但无法通过Stack Exchange Dump进行如下操作:

<users>
  <row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>

取决于根标记,行标记,我得到的是空模式或..某物:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _AboutMe: string (nullable = true)
 |    |    |-- _AccountId: long (nullable = true)
 |    |    |-- _CreationDate: string (nullable = true)
 |    |    |-- _DisplayName: string (nullable = true)
 |    |    |-- _DownVotes: long (nullable = true)
 |    |    |-- _Id: long (nullable = true)
 |    |    |-- _LastAccessDate: string (nullable = true)
 |    |    |-- _Location: string (nullable = true)
 |    |    |-- _ProfileImageUrl: string (nullable = true)
 |    |    |-- _Reputation: long (nullable = true)
 |    |    |-- _UpVotes: long (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _Views: long (nullable = true)
 |    |    |-- _WebsiteUrl: string (nullable = true)

+--------------------+
|                 row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
Spark          : 1.6.0
Python         : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1

我非常感谢您提供任何建议。

亲切的问候,

1 个答案:

答案 0 :(得分:1)

一段时间前,我尝试了相同的方法(在stackoverflow转储文件上使用spark-xml),但失败了...主要是因为DF被视为结构数组,并且处理性能确实很差。相反,我建议使用标准文本阅读器,并在UDF的每一行中映射Key =“ Value”,如下所示:

pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}

您还可以使用我的代码获取正确的数据类型:https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb(该模式与2017年3月的转储相匹配)。