Question

我正在使用Neo4j / Cypher，我的数据约为 200GB ，所以我想到了可扩展的解决方案“ spark”。

有两种解决方案可用来制作带有spark的neo4j图：

1）Cypher for Apache Spark（CAPS）

2）Neo4j-Spark-Connector

我用了第一个，CAPS。预处理的CSV具有两个“ geohash”信息：每行一个用于拾取，另一个用于下降我想要的是制作一个连接的geohash节点图。

CAPS仅允许通过映射节点制作图形：如果要将ID为0的节点连接到ID为1的节点，则需要具有起始ID为0和结束ID为1的关系。

一个非常简单的布局是：

import numpy as np
import pandas as pd 
from copy import deepcopy
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (16,9)
plt.style.use("ggplot")


data = pd.read_csv("/Users/ericostring/Desktop/OwnProjects/SimpliLearn/xclara.csv")
#predictors
f1 = data["Temperature Difference"].values
#target
f2 = data["Pressure Difference"].values
data.head()


x = np.array(list(zip(f1,f2)))

plt.title("Temperature & Pressure Difference")
plt.xlabel("Temperature Difference")
plt.ylabel("Pressure Difference")
plt.scatter(f1, f2, c="black", s=7)

#euclidian distance calculator
def dist(a, b, ax=1):
    return np.linalg.norm(a-b, axis=ax)

#number of clusters
k = 3

#x-coordinate of random centroids
c_x = np.random.randint(0, np.max(x)-20, size=k)

#y-coordinate of random centroids
c_y = np.random.randint(0, np.max(x)-20, size=k)

c = np.array(list(zip(c_x, c_y)), dtype=np.float32)

print(c)


plt.title("Temperature & Pressure Difference")
plt.xlabel("Temperature Difference")
plt.ylabel("Pressure Difference")
plt.scatter(f1, f2, c="black", s=7)
plt.scatter(c_x, c_y, marker="*", c="g", s=200)


#store the value of the centroids
c_old = np.zeros(c.shape)

#cluster labels (0, 1, 2)
cluster = np.zeros(len(x))

#error function. distance between new centroids and old centroids
error = dist(c, c_old, None)


#loop will run until error becoems zero
while error != 0:

    #assing each value to its closest cluster
    for i in range(len(x)):
        distances = dist(x[i], c)
        #argmin()- Returns the indices of the minimum values along an axis.
        cluster = np.argmin(distances)
        clusters[i]= cluster


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-114-ecc32fffa41f> in <module>()
      7         #argmin()- Returns the indices of the minimum values along an axis.
      8         cluster = np.argmin(distances)
----> 9         clusters[i]= cluster

TypeError: 'numpy.int64' object does not support item assignment

基于我已将CSV加载到Spark Dataframe中的信息，然后将其拆分为：

提取数据帧
删除数据框和
旅行数据框

我已经为两个第一个数据帧生成了一个id，并通过在第三个数据帧中添加列来创建了映射结果是：为每个映射行生成一对节点（pickup- [Trip]-> drop）。

我遇到的问题是：

1）可以为不同的行程重复拾取或下车的风水散布=>我想合并节点的创建

2）一次旅行的下车可能是另一次旅行的皮卡，因此我需要将这两个节点合并为一个

我试图更改图形，但令我惊讶的是火花图形是不可变的=>您无法应用密码查询来对其进行更改。

因此，有一种方法可以使连接，面向和 geohash 带有火花的图？

这是我的代码：

Nodes: (just id, no properties)

id
0
1
2

Relationships: (just the mandatory fields)

id | start | end
0  | 0     | 1
1  | 0     | 2

Answer 1

您是对的，就像Spark一样，CAPS是一个不可变的系统。但是，使用CAPS可以从Cypher语句中创建新图形：https://github.com/opencypher/cypher-for-apache-spark/blob/master/spark-cypher-examples/src/main/scala/org/opencypher/spark/examples/MultipleGraphExample.scala

目前CONSTRUCT子句对MERGE的支持有限。它仅允许将已经绑定的节点添加到新创建的图，而每个绑定节点仅添加一次，而与绑定表中出现的时间无关。

考虑以下查询：

MATCH (n), (m)
CONSTRUCT
  CREATE (n), (m)
RETURN GRAPH

结果图将具有与输入图一样多的节点。

要解决您的问题，您可以使用两种方法：a）在创建图形之前已经进行重复数据删除，b）使用Cypher查询。方法b）如下所示：

// assuming that graph is the graph created at step 11
session.catalog.store("inputGraph", graph)

session.cypher("""
  CATALOG CREATE GRAPH temp {
    FROM GRAPH session.inputGraph
    MATCH (n)
    WITH DISTINCT n.a AS a, n.b as b
    CONSTRUCT 
      CREATE (:HashNode {a: a, b as b})
    RETURN GRAPH
  }
""".stripMargin)

val mergeGraph = session.cypher("""
  FROM GRAPH inputGraph
  MATCH (from)-[via]->(to)
  FROM GRAPH temp
  MATCH (n), (m)
  WHERE from.a = n.a AND from.b = n.b AND to.a = m.a AND to.b = m.b
  CONSTRUCT 
    CREATE (n)-[COPY OF via]->(m)
  RETURN GRAPH
""".stripMargin).graph

注意：将属性名称用于机器人拾取和放置节点（例如a和b）

带Spark的Geohash NEO4j图

1 个答案: