lxml.Element对象的Spark Python RDD?

时间:2017-03-16 20:10:21

标签: python apache-spark xml-parsing lxml

我想对一组XML文档进行一些交互式探索。我正在尝试使用lxml解析文档,并使用find,findall和xpath方法进行查询。但是,当我尝试创建Element对象的RDD时,PySpark会窒息:

from lxml import etree
from lxml.etree import XMLSyntaxError
def get_root(xml):
  xml_bytes = bytes(bytearray(xml, encoding = 'utf-8'))
  try:
    return [etree.XML(xml_bytes)]
  except XMLSyntaxError:
    return []

docs = [
    "<doc><tag name='hoo'>hah</tag><tag name='wah'>zoo</tag></doc>"
  , "<doc><tag name='hoo'>yah</tag><tag name='wah'>woo</tag></doc>"
]
roots = [get_root(x)[0] for x in docs]
roots
  [<Element doc at 0x3b2280>, <Element doc at 0x3b2140>]
docs_rdd = sc.parallelize(docs)
roots_rdd = docs_rdd.flatMap(lambda d: get_root(d))
roots_rdd.count()
  2
roots_rdd.first()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "lxml.etree.pyx", line 1033, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:42268)
    File "lxml.etree.pyx", line 881, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:40855)
    File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode  (src/lxml/lxml.etree.c:12875)
  AssertionError: invalid Element proxy at 62728864

有人可以帮我理解这里发生了什么吗?

使用pip或pip3安装的Python 2.7.x或3.5.x,Spark 1.6.x,lxml。

提前致谢!

1 个答案:

答案 0 :(得分:1)

lxml个对象不可序列化,无法在执行程序和驱动程序之间传递或进行混洗。无需使用Spark即可轻松复制:

from lxml import etree
import pickle

pickle.loads(pickle.dumps(etree.XML("<doc>foo</doc>")))
AssertionError                            Traceback (most recent call last)
...
AssertionError: invalid Element proxy at ...

您仍然可以使用lxml来解析和获取可序列化的Python对象:

from operator import attrgetter

docs_rdd.flatMap(get_root).flatMap(lambda x: x).map(attrgetter("text")).collect()
['hah', 'zoo', 'yah', 'woo']