Gremlin分页用于大型数据集查询

时间:2018-08-08 07:44:32

标签: python-3.x neo4j cypher gremlin gremlin-server

我正在使用gremlin服务器,我有一个大数据集,并且正在执行gremlin分页。以下是查询示例:

query = """g.V().both().both().count()"""
data = execute_query(query)
for x in range(0,int(data[0]/10000)+1):
    print(x*10000, " - ",(x+1)*10000)
    query = """g.V().both().both().range({0}*10000, {1}*10000)""".format(x,x+1)
    data = execute_query(query)

def execute_query(query):
    """query execution"""

以上查询工作正常,对于分页,我必须知道停止执行查询的范围。为了获得范围,我必须首先获取查询的计数并传递给for循环。还有其他可以使用gremlin的分页吗?

-需要分页,因为在单个ex中获取100k数据时分页失败。 g.V().both().both().count()

如果我们不使用分页,则会出现以下错误:

ERROR:tornado.application:Uncaught exception, closing connection.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 554, in wrapper
    return callback(*args)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 343, in wrapped
    raise_exc_info(exc)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 314, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 807, in _on_frame_data
    self._receive_frame()
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 697, in _receive_frame
    self.stream.read_bytes(2, self._on_frame_start)
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 312, in read_bytes
    assert isinstance(num_bytes, numbers.Integral)
  File "/usr/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/usr/lib/python3.5/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3e1c409ae8>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 604, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 554, in wrapper
    return callback(*args)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 343, in wrapped
    raise_exc_info(exc)
  File "<string>", line 3, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 314, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 807, in _on_frame_data
    self._receive_frame()
  File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 697, in _receive_frame
    self.stream.read_bytes(2, self._on_frame_start)
  File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 312, in read_bytes
    assert isinstance(num_bytes, numbers.Integral)
  File "/usr/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/usr/lib/python3.5/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
Traceback (most recent call last):
  File "/home/rgupta/Documents/BitBucket/ecodrone/ecodrone/test2.py", line 59, in <module>
    data = execute_query(query)
  File "/home/rgupta/Documents/BitBucket/ecodrone/ecodrone/test2.py", line 53, in execute_query
    results = future_results.result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 405, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/resultset.py", line 81, in cb
    f.result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/connection.py", line 77, in _receive
    self._protocol.data_received(data, self._results)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
    self.data_received(data, results_dict)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
    self.data_received(data, results_dict)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
    self.data_received(data, results_dict)
  File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received

此行重复100次File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received

1 个答案:

答案 0 :(得分:1)

此问题在很大程度上得到了here的回答,但我将添加更多评论。

您的分页方法确实非常昂贵,因为我不知道会优化该特定遍历的任何图形,并且您基本上要遍历所有这些数据。您对ptq = pq1 + pq2; printf ("Your grades from prelims is %.2f", ptq); 执行一次,然后对第一个10000进行迭代,然后对第二个10000,对第一个10000进行迭代,然后对第二个10000进行迭代,然后对第三个10000进行迭代,对第一个20000进行迭代,然后进行迭代第三个10000,依此类推...

我不确定您的逻辑是否还有其他内容,但是您所拥有的看起来像是“分批”的一种形式,可以获取更小的结果。不需要那样做,因为Gremlin Server已经在内部为您完成此操作。您只是发送count(),如果使用g.V().both().both()配置选项,Gremlin Server将对结果进行批处理。

无论如何,除了我提到的另一个问题所解释的内容之外,没有一种更好的方法可以使分页工作成为我所知道的。