Question

我使用链接Setup spark cluster and titan and cassandra来设置我的拓扑。我的拓扑结构如下：

虚拟机： 3号：每个核心8个每个RAM 16GB。

以下是VM的拓扑及其各个组件：

注意：在下面的图表中，＆＃34; master＆＃34;与＆＃34; IP X＆＃34;

相同

请注意，根据链接帖子，Janusgraph 0.2.0不需要hdfs，因为它使用了TP 3.2.6，它已经删除了hdfs作为中间存储的使用。

现在，如果我的理解是正确的，当我使用Janusgraph将数据推送到我的Cassandra集群时，我们将其复制因子保持为3，以便可以在Cassandra的所有节点上复制，我们的Spark工作人员可以在其上复制工作。

考虑到事实，我使用bellow属性文件将数据推送到基于集群Cassandra + Elasticsearch的Janusgraph：

gremlin.graph=org.janusgraph.core.JanusGraphFactory

storage.backend=cassandrathrift
storage.hostname=IP A, IP B, IP C
storage.cassandra.keyspace=testDev
storage.cassandra.replication-factor=3 

index.search.backend=elasticsearch
index.search.hostname=IP A, IP B, IP C
index.search.elasticsearch.client-only=true

数据推送成功，我按照检查进行了验证：

cqlsh shows keyspace with name testDec
when using same properties file, and doing OLTP based g.V().count, it returns me correctly.

现在，我想介绍Spark Graph计算机。我测试了一个运行的本地Spark实例，并对单个VM本地托管的cassandra + elasticsearch执行OLAP。它工作得很好，虽然我的测试数据很慢，但这很麻烦。但是当我将cassandra集群引入混合时，我的火花作业/任务很简单就不会启动（从UI推断）。

以下是我用来从Cassandra后端创建Hadoop图并在其上执行OLAP的属性文件。：

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# JanusGraph Cassandra InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
janusgraphmr.ioformat.conf.storage.hostname=IP A, IP B, IP C
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=testDev
janusgraphmr.ioformat.cf-name=edgestore

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=testDev
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647

#
# SparkGraphComputer Configuration
#
spark.master=spark://IP X:7077
spark.executor.cores=3
spark.executor.memory=6g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executorEnv.HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
spark.executorEnv.SPARK_CONF_DIR=/home/spark/spark/conf
spark.driverEnv.HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
spark.driverEnv.SPARK_CONF_DIR=/home/spark/spark/conf
spark.driver.extraLibraryPath=/home/hadoop/hadoop/lib/native
spark.executor.extraLibraryPath=/home/hadoop/hadoop/lib/native

gremlin.spark.persistContext=true

# Default Graph Computer
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

关注来自here和here的帖子，看起来我的配置没问题（.properties文件），但是每次进行任何OLAP查询时，我的JOB都会丢失。我没有看到任何从UI开始的任务，经过很长一段时间后，我收到错误stack trace。

我最初认为我设置spark独立群集的方式有些错误，但后来我尝试通过从文件系统中读取图形来进行OLAP。

我使用以下属性读取GraphSON文件：

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=/opt/JanusGraph/0.2.0/data/grateful-dead.json
gremlin.hadoop.outputLocation=output

#
# SparkGraphComputer Configuration
#
spark.master=spark://IP X:7077
spark.executor.cores=2
spark.executor.memory=4g
spark.serializer=org.apache.spark.serializer.KryoSerializer
gremlin.spark.persistContext=true

# Default Graph Computer
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

当我将上面的属性文件加载为：

graph = GrpahFactory.open(/conf.properties")
g = graph.traversal().withComputer(SparkGraphComputer)
g.V().count()

它返回预期的输出，我能够在UI上看到Spark作业，以及阶段。这实际上意味着Spark OLAP查询成功运行。

如果是这种情况，看起来虽然建立了与Spark的连接，但我无法从插入Cassandra节点读取数据。为什么会这样？

任何指示都会被扼杀！

我需要了解更多相关信息，并在此处添加相同信息。

干杯： - ）

Janusgraph + Cassandra＆amp; ES集群作为后端+ Spark集群作为分析。拓扑与拓扑配置？

0 个答案: