使用spark上下文将JavaList保存到Cassandra表

时间:2016-09-13 17:16:52

标签: java scala apache-spark cassandra

嗨,我对spark和scala很新,在这里我面临一些将数据保存到cassandra中的问题以下是我的方案

1)我从我的java类到scala类得到用户定义对象的列表(比如说包含firstName,lastName等的用户对象..),直到这里它很好我可以访问用户对象并能够打印其内容

2)现在我想使用spark上下文将那个usersList保存到cassandra表中,我已经经历了很多例子,但我看到用 caseClass 创建 Seq 的每一个硬编码的值然后保存到cassandra,我已经尝试过并且正常工作,如下所示

import scala.collection.JavaConversions._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

import com.datastax.spark.connector._
import java.util.ArrayList

object SparkCassandra extends App {
    val conf = new SparkConf()
        .setMaster("local[*]")
        .setAppName("SparkCassandra")
        //set Cassandra host address as your local address
        .set("spark.cassandra.connection.host", "127.0.0.1")
    val sc = new SparkContext(conf)
     val usersList = Test.getUsers
     usersList.foreach(x => print(x.getFirstName))
    val collection = sc.parallelize(Seq(userTable("testName1"), userTable("testName1")))
    collection.saveToCassandra("demo", "user", SomeColumns("name"))
    sc.stop()
}

case class userTable(name: String)

但是我的要求是使用来自usersList的动态值而不是硬连接值,或任何其他方式来实现这一点。

2 个答案:

答案 0 :(得分:0)

如果创建RDDCassandraRow个对象,则可以直接保存结果,而无需指定列或案例类。此外,CassandraRow具有非常方便的fromMap函数,因此您可以将行定义为Map个对象,转换它们并保存结果。

示例:

val myData = sc.parallelize(
  Seq(
    Map("name" -> "spiffman", "address" -> "127.0.0.1"),
    Map("name" -> "Shabarinath", "address" -> "127.0.0.1")
  )
)

val cassandraRowData = myData.map(rowMap => CassandraRow.fromMap(rowMap))

cassandraRowData.saveToCassandra("keyspace", "table")

答案 1 :(得分:0)

最后,我得到了我的测试要求的解决方案并正常工作如下:

我的Scala代码:

import scala.collection.JavaConversions.asScalaBuffer
import scala.reflect.runtime.universe
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import com.datastax.spark.connector.toNamedColumnRef
import com.datastax.spark.connector.toRDDFunctions

object JavaListInsert {
  def randomStores(sc: SparkContext, users: List[User]): RDD[(String, String, String)] = {
       sc.parallelize(users).map { x => 
       val fistName = x.getFirstName
       val lastName = x.getLastName
       val city = x.getCity
       (fistName, lastName, city)
    }
  }

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("cassandraInsert")
    val sc = new SparkContext(conf)
    val usersList = Test.getUsers.toList
    randomStores(sc, usersList).
      saveToCassandra("test", "stores", SomeColumns("first_name", "last_name", "city"))
    sc.stop
  }
}

Java Pojo对象:

    import java.io.Serializable;
    public class User implements Serializable{
        private static final long serialVersionUID = -187292417543564400L;
        private String firstName;
        private String lastName;
        private String city;

        public String getFirstName() {
            return firstName;
        }

        public void setFirstName(String firstName) {
            this.firstName = firstName;
        }

        public String getLastName() {
            return lastName;
        }

        public void setLastName(String lastName) {
            this.lastName = lastName;
        }

        public String getCity() {
            return city;
        }

        public void setCity(String city) {
            this.city = city;
        }
}

返回用户列表的Java类:

import java.util.ArrayList;
import java.util.List;


public class Test {
    public static List<User> getUsers() {
        ArrayList<User> usersList = new ArrayList<User>();
        for(int i=1;i<=100;i++) {
            User user = new User();
            user.setFirstName("firstName_+"+i);
            user.setLastName("lastName_+"+i);
            user.setCity("city_+"+i);
            usersList.add(user);
        }
        return usersList;
    }
}