如何在shell中加载Spark Cassandra Connector?

时间:2014-09-14 19:57:47

标签: cassandra apache-spark datastax-enterprise

我正在尝试在Spark 1.1.0中使用Spark Cassandra Connector

我已经从GitHub上的主分支成功构建了jar文件,并且已经使用了包含的演示。但是,当我尝试将jar文件加载到spark-shell时,我无法导入com.datastax.spark.connector包中的任何类。

我尝试在--jars上使用spark-shell选项,并将带有jar文件的目录添加到Java的CLASSPATH中。这些选项都不起作用。事实上,当我使用--jars选项时,日志记录输出显示Datastax jar正在加载,但我仍然无法从com.datastax导入任何内容。

我已经能够使用spark-shell将Tuplejump Calliope Cassandra连接器加载到--jars,所以我知道它正在运行。它只是Datastax连接器,对我来说是失败的。

6 个答案:

答案 0 :(得分:28)

我明白了。以下是我的所作所为:

$ git clone https://github.com/datastax/spark-cassandra-connector.git
$ cd spark-cassandra-connector
$ sbt/sbt assembly
$ $SPARK_HOME/bin/spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/connector-assembly-1.2.0-SNAPSHOT.jar 

在scala提示符中,

scala> sc.stop
scala> import com.datastax.spark.connector._
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> import org.apache.spark.SparkConf
scala> val conf = new SparkConf(true).set("spark.cassandra.connection.host", "my cassandra host")
scala> val sc = new SparkContext("spark://spark host:7077", "test", conf)

答案 1 :(得分:19)

编辑:现在事情变得容易了

有关详细说明,请查看项目网站 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/13_spark_shell.md

或者随意使用Spark-Packages加载库(并非所有版本都已发布) http://spark-packages.org/package/datastax/spark-cassandra-connector

> $SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3-s_2.10

以下假设您使用的是OSS Apache C *

您需要在-driver-class-path设置为包含所有连接器库的情况下启动该类

我将引用杰出的Amy Tobey

中的博客文章
  

我发现最简单的方法是设置类路径   使用导入的必要类重新启动REPL中的上下文   使sc.cassandraTable()可见。   新加载的方法不会显示在选项卡完成中。我不知道为什么。

  /opt/spark/bin/spark-shell --driver-class-path $(echo /path/to/connector/*.jar |sed 's/ /:/g')
  

它将打印一堆日志信息然后呈现scala>提示。

scala> sc.stop
  

现在上下文已停止,是时候导入连接器了。

scala> import com.datastax.spark.connector._
scala> val conf = new SparkConf()
scala> conf.set("cassandra.connection.host", "node1.pc.datastax.com")
scala> val sc = new SparkContext("local[2]", "Cassandra Connector Test", conf)
scala> val table = sc.cassandraTable("keyspace", "table")
scala> table.count

如果您正在使用DSE< 4.5.1

DSE类加载器和以前的软件包命名约定存在轻微问题,这将阻止您查找新的spark-connector库。你应该能够通过在启动spark-shell的脚本中删除指定DSE类加载器的行来解决这个问题。

答案 2 :(得分:6)

如果你想避免在shell中停止/启动上下文,你也可以将它添加到你的spark属性中:

{spark_install} /conf/spark-defaults.conf

spark.cassandra.connection.host=192.168.10.10

答案 3 :(得分:5)

要从spark-shell访问Cassandra,我已经用cassandra-spark驱动程序构建了一个具有所有依赖关系的程序集(" uberjar")。使用像这样的--jars选项将它提供给spark-shell:

spark-shell --jars spark-cassandra-assembly-1.0.0-SNAPSHOT-jar-with-dependencies.jar

我遇到了这里描述的相同问题,这个方法既简单又方便(而不是加载很长的依赖项列表)

我已经用POM文件创建了一个要点download。使用pom创建uberjar你应该这样做:

mvn package

如果您正在使用sbt,请查看sbt-assembly插件。

答案 4 :(得分:0)

以下步骤描述了如何使用Spark节点和Cassandra节点设置服务器。

设置开源Spark

这假设您已经安装了Cassandra。

第1步:下载并设置Spark

Go to http://spark.apache.org/downloads.html.

a)为简单起见,我们将使用其中一个预构建的Spark包。 选择Spark版本2.0.0并预先为Hadoop 2.7构建,然后直接下载。这将下载包含Spark的内置二进制文件的存档。

b)将其解压缩到您选择的目录中。我会把我放在〜/ apps / spark-1.2

c)测试Spark正在打开Shell

第2步:测试Spark的工作原理

a)cd进入Spark目录 运行" ./ bin / spark-shell"。这将打开Spark交互式shell程序

b)如果一切正常,它应该显示以下提示:" scala>"

运行简单的计算:

sc.parallelize(1到50).sum( + ) 应该输出1250。

c)祝贺Spark正在努力! 使用命令"退出"

退出Spark shell

Spark Cassandra连接器

要将Spark连接到Cassandra集群,需要将Cassandra Connector添加到Spark项目中。 DataStax在GitHub上提供了自己的Cassandra Connector,我们将使用它。

  1. 克隆Spark Cassandra Connector存储库:

    https://github.com/datastax/spark-cassandra-connector

  2. 进入" spark-cassandra-connector"构建Spark Cassandra连接器 执行命令

    ./ sbt / sbt Dscala-2.11 = true assembly

  3. 这应该将编译的jar文件输出到名为" target"的目录中。将有两个jar文件,一个用于Scala,另一个用于Java。 我们感兴趣的罐子是:" spark-cassandra-connector-assembly-1.1.1-SNAPSHOT.jar" Scala的那个。 将jar文件移动到一个易于查找的目录中:我将我的文件放入〜/ apps / spark-1.2 / jars

    将连接器加载到Spark Shell中:

    使用以下命令启动shell:

      

    ../ bin / spark-shell -jars   〜/应用/火花1.2 /罐/火花卡桑德拉连接器组件-1.1.1- SNAPSHOT.jar

    将Spark Context连接到Cassandra集群并停止默认上下文:

      

    sc.stop

    导入必要的jar文件:

    import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
    

    使用Cassandra连接详细信息创建一个新的SparkConf:

      

    val conf = new SparkConf(true).set(" spark.cassandra.connection.host",   "本地主机&#34)

    创建一个新的Spark上下文:

      

    val sc = new SparkContext(conf)

    您现在有一个新的SparkContext连接到您的Cassandra集群。

答案 5 :(得分:0)

使用Window-7,8,10的JAVA中的Spark-Cassandra-Connector完整代码

import com.datastax.driver.core.Session;
import com.datastax.spark.connector.cql.CassandraConnector;
import com.google.common.base.Optional;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFlatMapFunction;
import scala.Tuple2;
import spark_conn.Spark_connection;
import java.io.Serializable;
import java.math.BigDecimal;
import java.text.MessageFormat;
import java.util.*;
import static com.datastax.spark.connector.CassandraJavaUtil.*;


public class App implements Serializable
{
    private transient SparkConf conf;

    private App(SparkConf conf) {
        this.conf = conf;
    }

    private void run() {
        JavaSparkContext sc = new JavaSparkContext(conf);
        generateData(sc);
        compute(sc);
        showResults(sc);
        sc.stop();
    }

    private void generateData(JavaSparkContext sc) {
    CassandraConnector connector =   CassandraConnector.apply(sc.getConf());

        // Prepare the schema
   try{ 
   Session session=connector.openSession();
   session.execute("DROP KEYSPACE IF EXISTS java_api");
   session.execute("CREATE KEYSPACE java_api WITH 
   replication = {'class': 'SimpleStrategy', 'replication_factor': 1}");
   session.execute("CREATE TABLE java_api.products 
   (id INT PRIMARY KEY, name TEXT, parents LIST<INT>)");
   session.execute("CREATE TABLE java_api.sales 
   (id UUID PRIMARY KEY,  product INT, price DECIMAL)");
   session.execute("CREATE TABLE java_api.summaries 
   (product INT PRIMARY KEY, summary DECIMAL)");
  }catch(Exception e){System.out.println(e);}

        // Prepare the products hierarchy
   List<Product> products = Arrays.asList(
   new Product(0, "All products", Collections.<Integer>emptyList()),
                new Product(1, "Product A", Arrays.asList(0)),
                new Product(4, "Product A1", Arrays.asList(0, 1)),
                new Product(5, "Product A2", Arrays.asList(0, 1)),
                new Product(2, "Product B", Arrays.asList(0)),
                new Product(6, "Product B1", Arrays.asList(0, 2)),
                new Product(7, "Product B2", Arrays.asList(0, 2)),
                new Product(3, "Product C", Arrays.asList(0)),
                new Product(8, "Product C1", Arrays.asList(0, 3)),
                new Product(9, "Product C2", Arrays.asList(0, 3))
    );

   JavaRDD<Product> productsRDD = sc.parallelize(products);
   javaFunctions(productsRDD, Product.class).
   saveToCassandra("java_api", "products");

   JavaRDD<Sale> salesRDD = productsRDD.filter
   (new Function<Product, Boolean>() {
            @Override
            public Boolean call(Product product) throws Exception {
                return product.getParents().size() == 2;
            }
        }).flatMap(new FlatMapFunction<Product, Sale>() {
            @Override
            public Iterable<Sale> call(Product product) throws Exception {
                Random random = new Random();
                List<Sale> sales = new ArrayList<>(1000);
                for (int i = 0; i < 1000; i++) {
                  sales.add(new Sale(UUID.randomUUID(), 
                 product.getId(), BigDecimal.valueOf(random.nextDouble())));
                }
                return sales;
            }
        });

      javaFunctions(salesRDD, Sale.class).saveToCassandra
      ("java_api", "sales");
    }

    private void compute(JavaSparkContext sc) {
        JavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc)
                .cassandraTable("java_api", "products", Product.class)
                .keyBy(new Function<Product, Integer>() {
                    @Override
                    public Integer call(Product product) throws Exception {
                        return product.getId();
                    }
                });

        JavaPairRDD<Integer, Sale> salesRDD = javaFunctions(sc)
                .cassandraTable("java_api", "sales", Sale.class)
                .keyBy(new Function<Sale, Integer>() {
                    @Override
                    public Integer call(Sale sale) throws Exception {
                        return sale.getProduct();
                    }
                });

        JavaPairRDD<Integer, Tuple2<Sale, Product>> joinedRDD = salesRDD.join(productsRDD);

        JavaPairRDD<Integer, BigDecimal> allSalesRDD = joinedRDD.flatMapToPair(new PairFlatMapFunction<Tuple2<Integer, Tuple2<Sale, Product>>, Integer, BigDecimal>() {
            @Override
            public Iterable<Tuple2<Integer, BigDecimal>> call(Tuple2<Integer, Tuple2<Sale, Product>> input) throws Exception {
                Tuple2<Sale, Product> saleWithProduct = input._2();
                List<Tuple2<Integer, BigDecimal>> allSales = new ArrayList<>(saleWithProduct._2().getParents().size() + 1);
                allSales.add(new Tuple2<>(saleWithProduct._1().getProduct(), saleWithProduct._1().getPrice()));
                for (Integer parentProduct : saleWithProduct._2().getParents()) {
                    allSales.add(new Tuple2<>(parentProduct, saleWithProduct._1().getPrice()));
                }
                return allSales;
            }
        });

        JavaRDD<Summary> summariesRDD = allSalesRDD.reduceByKey(new Function2<BigDecimal, BigDecimal, BigDecimal>() {
            @Override
            public BigDecimal call(BigDecimal v1, BigDecimal v2) throws Exception {
                return v1.add(v2);
            }
        }).map(new Function<Tuple2<Integer, BigDecimal>, Summary>() {
            @Override
            public Summary call(Tuple2<Integer, BigDecimal> input) throws Exception {
                return new Summary(input._1(), input._2());
            }
        });

        javaFunctions(summariesRDD, Summary.class).saveToCassandra("java_api", "summaries");
    }

    private void showResults(JavaSparkContext sc) {
        JavaPairRDD<Integer, Summary> summariesRdd = javaFunctions(sc)
                .cassandraTable("java_api", "summaries", Summary.class)
                .keyBy(new Function<Summary, Integer>() {
                    @Override
                    public Integer call(Summary summary) throws Exception {
                        return summary.getProduct();
                    }
                });

        JavaPairRDD<Integer, Product> productsRdd = javaFunctions(sc)
                .cassandraTable("java_api", "products", Product.class)
                .keyBy(new Function<Product, Integer>() {
                    @Override
                    public Integer call(Product product) throws Exception {
                        return product.getId();
                    }
                });

        List<Tuple2<Product, Optional<Summary>>> results = productsRdd.leftOuterJoin(summariesRdd).values().toArray();

        for (Tuple2<Product, Optional<Summary>> result : results) {
            System.out.println(result);
        }
    }

    public static void main(String[] args) {
//        if (args.length != 2) {
//            System.err.println("Syntax: com.datastax.spark.demo.App <Spark Master URL> <Cassandra contact point>");
//            System.exit(1);
//        }

//      SparkConf conf = new SparkConf(true)
//        .set("spark.cassandra.connection.host", "127.0.1.1")
//        .set("spark.cassandra.auth.username", "cassandra")            
//        .set("spark.cassandra.auth.password", "cassandra");

        //SparkContext sc = new SparkContext("spark://127.0.1.1:9045", "test", conf);

        //return ;

        /* try{
            SparkConf conf = new SparkConf(true); 
            conf.setAppName("Spark-Cassandra Integration");
            conf.setMaster("yarn-cluster");
            conf.set("spark.cassandra.connection.host", "192.168.1.200");
            conf.set("spark.cassandra.connection.rpc.port", "9042");
            conf.set("spark.cassandra.connection.timeout_ms", "40000");
            conf.set("spark.cassandra.read.timeout_ms", "200000");
            System.out.println("Hi.......Main Method1111...");
            conf.set("spark.cassandra.auth.username","cassandra");
            conf.set("spark.cassandra.auth.password","cassandra");
            System.out.println("Connected Successful...!\n");
            App app = new App(conf);
            app.run();
       }catch(Exception e){System.out.println(e);}*/

        SparkConf conf = new SparkConf();
        conf.setAppName("Java API demo");
//     conf.setMaster(args[0]);
//        conf.set("spark.cassandra.connection.host", args[1]);
          conf.setMaster("spark://192.168.1.117:7077");
          conf.set("spark.cassandra.connection.host", "192.168.1.200");
          conf.set("spark.cassandra.connection.port", "9042");
          conf.set("spark.ui.port","4040");
          conf.set("spark.cassandra.auth.username","cassandra");
          conf.set("spark.cassandra.auth.password","cassandra");
       App app = new App(conf);
        app.run();
    }

    public static class Product implements Serializable {
        private Integer id;
        private String name;
        private List<Integer> parents;

        public Product() { }

        public Product(Integer id, String name, List<Integer> parents) {
            this.id = id;
            this.name = name;
            this.parents = parents;
        }

        public Integer getId() { return id; }
        public void setId(Integer id) { this.id = id; }

        public String getName() { return name; }
        public void setName(String name) { this.name = name; }

        public List<Integer> getParents() { return parents; }
        public void setParents(List<Integer> parents) { this.parents = parents; }

        @Override
        public String toString() {
            return MessageFormat.format("Product'{'id={0}, name=''{1}'', parents={2}'}'", id, name, parents);
        }
    }

    public static class Sale implements Serializable {
        private UUID id;
        private Integer product;
        private BigDecimal price;

        public Sale() { }

        public Sale(UUID id, Integer product, BigDecimal price) {
            this.id = id;
            this.product = product;
            this.price = price;
        }

        public UUID getId() { return id; }
        public void setId(UUID id) { this.id = id; }

        public Integer getProduct() { return product; }
        public void setProduct(Integer product) { this.product = product; }

        public BigDecimal getPrice() { return price; }
        public void setPrice(BigDecimal price) { this.price = price; }

        @Override
        public String toString() {
            return MessageFormat.format("Sale'{'id={0}, product={1}, price={2}'}'", id, product, price);
        }
    }

    public static class Summary implements Serializable {
        private Integer product;
        private BigDecimal summary;

        public Summary() { }

        public Summary(Integer product, BigDecimal summary) {
            this.product = product;
            this.summary = summary;
        }

        public Integer getProduct() { return product; }
        public void setProduct(Integer product) { this.product = product; }

        public BigDecimal getSummary() { return summary; }
        public void setSummary(BigDecimal summary) { this.summary = summary; }

        @Override
        public String toString() {
            return MessageFormat.format("Summary'{'product={0}, summary={1}'}'", product, summary);
        }
    }
}