在neo4j中写pyspark Rdd或DF

时间:2018-04-03 07:58:03

标签: apache-spark neo4j pyspark rdd

我遇到了py2neo和spark-driver的一些问题,因为我无法在foreach循环或map循环中插入节点。就像下面的代码一样。

from py2neo import authenticate, Graph, cypher, Node
from pyspark import broadcast
infos=df.rdd

authenticate("localhost:7474", "neo4j", "admin")
graph = Graph(password='admin')
tx = graph.begin()

def node(row):
    query = Node("item", event_id=row[0], text=row[19])
    tx.create(query)



infos.foreach(node)
tx.commit()

这是堆栈跟踪的结束:

/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in _wrap_function(sc, func, deserializer, serializer, profiler)
   2386     assert serializer, "serializer should not be empty"
   2387     command = (func, profiler, deserializer, serializer)
-> 2388     pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
   2389     return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
   2390                                   sc.pythonVer, broadcast_vars, sc._javaAccumulator)

/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command)
   2372     # the serialized command will be compressed by broadcast
   2373     ser = CloudPickleSerializer()
-> 2374     pickled_command = ser.dumps(command)
   2375     if len(pickled_command) > (1 << 20):  # 1M
   2376         # The broadcast will have same life cycle as created PythonRDD

/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/serializers.py in dumps(self, obj)
    462 
    463     def dumps(self, obj):
--> 464         return cloudpickle.dumps(obj, 2)
    465 
    466 

/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/cloudpickle.py in dumps(obj, protocol)
    702 
    703     cp = CloudPickler(file,protocol

我想我不能在循环内传递参数tx。 我们试图通过直接在循环内部实现连接来绕过这个问题,就像下面的代码一样。它适用于小矩阵但是当我尝试使用2000万行时,它会在某个时刻停止

from py2neo import authenticate, Graph, cypher, Node
infos=df.rdd
authenticate("localhost:7474", "neo4j", "password")


def node(row):
    graph = Graph(password='admin')
    tx = graph.begin()
    query = Node("item", event_id=row[0], text=row[19])
    tx.create(query)
    tx.commit()

infos.foreach(node)

我对neo4j-spark连接器进行了一些研究,似乎你可以添加库但是没有提供示例,我根本不确定这样的功能实际上是在python中提供的。什么是解决这个问题的最佳方法?

1 个答案:

答案 0 :(得分:0)

解决此类问题的标准模式是使用foreachPartition

def nodes(rows):
    graph = Graph(password='admin')
    tx = graph.begin()
    for row in rows:
        query = Node("item", event_id=row[0], text=row[19])
        tx.create(query)
    tx.commit()

infos.foreachPartition(nodes)