Spark:Cassandra如何使用查询从pssandra中获取数据与pyspark

时间:2018-04-23 08:25:24

标签: cassandra pyspark

我想将查询作为参数传递,但它给出了以下错误。

  

url =' jdbc:cassandra:// localhost:9042 / tutorialspoint'
      query ='从emp LIMIT 10'

中选择*
#.option("driver", "com.dbschema.CassandraJdbcDriver")\
df = spark_sql_context.read.format('jdbc')\
               .option("driver", "com.dbschema.CassandraJdbcDriver")\
               .option("url",url)\
               .option("dbtable", query)\
               .option("numPartitions", 2) \
               .load()

java.sql.SQLException: com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable alternative at input 'select' (SELECT * FROM [select]...)
    at com.dbschema.CassandraPreparedStatement.executeQuery(CassandraPreparedStatement.java:113)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:113)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

2 个答案:

答案 0 :(得分:1)

根据您的查询,我正在提供此解决方案。它应该适用于几乎所有常见的select查询(不确定连接)。我正在将您的查询拆分为多个部分,如下所述,因为这是我与Cassandra合作的一种方式。我也改变了使用的驱动程序。希望它对你有用!

       column_names = (query.split("SELECT")[1]).split("from")[0].strip().split(",")
       print(column_names)

       table_name = query.split("from")[1].split(" ")[1]
       print(table_name)

       if "where" in query:
           where_condition = query.split("WHERE")[1]
           print(where_condition)
           df = self.spark_sql_context.read.format("org.apache.spark.sql.cassandra") \
               .load(table=table_name, keyspace=self.__keyspace).select(column_names).where(where_condition)

       else:
           df = self.spark_sql_context.read.format("org.apache.spark.sql.cassandra") \
               .load(table=table_name, keyspace=self.__keyspace).select(column_names)

答案 1 :(得分:0)

如果您已完成Spark和Cassandra集成,则可以像以下一样访问它:

spark_sql_context.read.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace="keyspace name").load()

<强>更新

在Java中,我们可以执行如下特定查询:

public static List<Row> selectSectorHourlyCounterTotals(String sectorName) {
        Statement statement = 
                new SimpleStatement(select * from tablename where sector_name ="'"+sectorName+"' allow filtering");
        ResultSet resultSet = dbSession.execute(statement);
        return resultSet.all();
    }

您需要将其转换为scala / python。