我遇到了py2neo和spark-driver的一些问题,因为我无法在foreach循环或map循环中插入节点。就像下面的代码一样。
from py2neo import authenticate, Graph, cypher, Node
from pyspark import broadcast
infos=df.rdd
authenticate("localhost:7474", "neo4j", "admin")
graph = Graph(password='admin')
tx = graph.begin()
def node(row):
query = Node("item", event_id=row[0], text=row[19])
tx.create(query)
infos.foreach(node)
tx.commit()
这是堆栈跟踪的结束:
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in _wrap_function(sc, func, deserializer, serializer, profiler)
2386 assert serializer, "serializer should not be empty"
2387 command = (func, profiler, deserializer, serializer)
-> 2388 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
2389 return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
2390 sc.pythonVer, broadcast_vars, sc._javaAccumulator)
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command)
2372 # the serialized command will be compressed by broadcast
2373 ser = CloudPickleSerializer()
-> 2374 pickled_command = ser.dumps(command)
2375 if len(pickled_command) > (1 << 20): # 1M
2376 # The broadcast will have same life cycle as created PythonRDD
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/serializers.py in dumps(self, obj)
462
463 def dumps(self, obj):
--> 464 return cloudpickle.dumps(obj, 2)
465
466
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/cloudpickle.py in dumps(obj, protocol)
702
703 cp = CloudPickler(file,protocol
我想我不能在循环内传递参数tx。 我们试图通过直接在循环内部实现连接来绕过这个问题,就像下面的代码一样。它适用于小矩阵但是当我尝试使用2000万行时,它会在某个时刻停止
from py2neo import authenticate, Graph, cypher, Node
infos=df.rdd
authenticate("localhost:7474", "neo4j", "password")
def node(row):
graph = Graph(password='admin')
tx = graph.begin()
query = Node("item", event_id=row[0], text=row[19])
tx.create(query)
tx.commit()
infos.foreach(node)
我对neo4j-spark连接器进行了一些研究,似乎你可以添加库但是没有提供示例,我根本不确定这样的功能实际上是在python中提供的。什么是解决这个问题的最佳方法?
答案 0 :(得分:0)
解决此类问题的标准模式是使用foreachPartition
:
def nodes(rows):
graph = Graph(password='admin')
tx = graph.begin()
for row in rows:
query = Node("item", event_id=row[0], text=row[19])
tx.create(query)
tx.commit()
infos.foreachPartition(nodes)