无法从连接到EC2上的Cassandra的EMR运行spark作业

时间:2016-02-22 07:30:46

标签: java amazon-ec2 apache-spark cassandra emr

我正在EMR集群中运行火花作业,该集群连接到EC2上的Cassandra

以下是我正在为项目使用的依赖项。   

<dependency>
    <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.6.0</version>
</dependency>
    <dependency>
      <groupId>com.datastax.spark</groupId>
      <artifactId>spark-cassandra-connector_2.10</artifactId>
      <version>1.5.0-M1</version>
    </dependency>

<dependency>
    <groupId>com.datastax.cassandra</groupId>
     <artifactId>cassandra-driver-core</artifactId>
     <version>2.1.6</version>
</dependency>

 <dependency>
     <groupId>com.datastax.spark</groupId>
     <artifactId>spark-cassandra-connector-java_2.10</artifactId>
     <version>1.5.0-M3</version>
 </dependency>

我面临的问题是如果我使用cassandra-driver-core 3.0.0,我会收到以下错误

java.lang.ExceptionInInitializerError
at mobi.vserv.SparkAutomation.DriverTester.doTest(DriverTester.java:28)
at mobi.vserv.SparkAutomation.DriverTester.main(DriverTester.java:16)
Caused by: java.lang.IllegalStateException: Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use.  This introduces codec resolution issues and potentially other incompatibility issues in the driver.  Please upgrade to Guava 16.01 or later.
at com.datastax.driver.core.SanityChecks.checkGuava(SanityChecks.java:62)
at com.datastax.driver.core.SanityChecks.check(SanityChecks.java:36)
at com.datastax.driver.core.Cluster.<clinit>(Cluster.java:67)
... 2 more

我尝试过包括guaua版本19.0.0,但我仍然无法完成这项工作

当我对cassandra-driver-core 2.1.6进行降级时,我收到以下错误。

com.datastax.driver.core.exceptions.NoHostAvailableException: All    host(s) tried for query failed (tried: /EMR PUBLIC IP:9042    (com.datastax.driver.core.TransportException: [/EMR PUBLIC IP:9042] Cannot       connect))
 at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:223)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:78)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1272)
at com.datastax.driver.core.Cluster.init(Cluster.java:158)
at com.datastax.driver.core.Cluster.connect(Cluster.java:248)

请注意,我已经在本地测试了我的代码并且它运行得非常好,我遵循了这里提到的不同的依赖关系组合https://github.com/datastax/spark-cassandra-connector

代码:

 public class App1 {

 private static Logger logger = LoggerFactory.getLogger(App1.class);

static SparkConf conf = new SparkConf().setAppName("SparkAutomation").setMaster("yarn-cluster");


static JavaSparkContext sc = null;
static
   {

    sc = new JavaSparkContext(conf);
   }


public static void main(String[] args) throws Exception {

    JavaRDD<String> Data = sc.textFile("S3 PATH TO GZ FILE/*.gz");

    JavaRDD<UserSetGet> usgRDD1=Data.map(new ConverLineToUSerProfile());

     List<UserSetGet> t3 = usgRDD1.collect(); 

     for(int i =0 ; i <=t3.size();i++){
         try{
         phpcallone php = new phpcallone();
         php.sendRequest(t3.get(i));
         }
         catch(Exception e){
             logger.error("This Has reached ====> " + e);
         }

     }

  } 
}




public class phpcallone{

private static Logger logger = LoggerFactory.getLogger(phpcallone.class);
static String pid;

public void sendRequest(UserSetGet usg) throws JSONException, IOException, InterruptedException {


     UpdateCassandra uc= new UpdateCassandra(); 
     try { 
         uc.UpdateCsrd(); 
         }
     catch (ClassNotFoundException e) {
         e.printStackTrace(); }
     }

}
   }

public class UpdateCassandra{
public void UpdateCsrd() throws ClassNotFoundException {

     Cluster.Builder clusterBuilder = Cluster.builder()
                .addContactPoint("PUBLIC IP ").withPort(9042)
                .withCredentials("username", "password");
     clusterBuilder.getConfiguration().getSocketOptions().setConnectTimeoutMillis(10000);

    try  {
        Session session = clusterBuilder.build().connect("dmp");
        session.execute("USE dmp");
        System.out.println("Connection established");

    } catch (Exception e) {
        e.printStackTrace();
    }
  }

 }

1 个答案:

答案 0 :(得分:1)

假设您使用的是EMR 4.1+,您可以将番石榴罐传入--jars选项以进行spark提交。然后向EMR提供配置文件以首先使用用户类路径。

例如,在文件setup.json

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.userClassPathFirst": "true",
      "spark.executor.userClassPathFirst": "true"
    }
  }
]

您可以在create-cluster aws cli命令中提供--configurations file://setup.json选项。