我试图从hive获取数据并使用Spark.Cery在Cassandra中插入相同数据。令人惊讶的是,我发现只有一条记录插入cassandra,尽管Data frame有4000多条记录。
import org.apache.spark.sql.SparkSession
import com.datastax.spark.connector.cql.CassandraConnector
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector._
import java.math.BigDecimal
case class sales(wk_nbr: Int,
store_nbr: Int,
sales_amt: BigDecimal)
object HiveConnector extends App {
val cassandraConfig = ConfigFactory.load("cassandra.conf")
println("cassandraConfig loaded = " + cassandraConfig)
val spark = SparkSession.builder().appName("HiveConnector")
.config("spark.sql.warehouse.dir", "file:/data/raw/historical/tables")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("mapred.input.dir.recursive","true")
.config("mapreduce.input.fileinputformat.input.dir.recursive","true")
.config("spark.cassandra.connection.host", "***********")
.config("spark.cassandra.auth.username", "*****un****")
.config("spark.cassandra.auth.password", "******pw*****")
.enableHiveSupport()
.master("yarn").getOrCreate()
import spark.implicits._
val query = "select wk_nbr,store_nbr,sum(sales_amt) as sales_amt from scan where visit_dt between '2018-05-08' and '2018-05-11' group by wm_yr_wk_nbr,store_nbr"
val resDF = spark.sql(query)
resDF.persist()
println("RESDF size = " + resDF.count()) //prints the record count
println("RESDF sample rec = " + resDF.show(2)) //see 2 records in the log
CassandraConnector(spark.sparkContext).withSessionDo { spark =>
spark.execute("CREATE TABLE raaspoc.sales_data (wk_nbr INT PRIMARY KEY, store_nbr INT, sales_amt DOUBLE)")
}
/*
None of the following saveToCassandra work - meaning not inserting all records but only one record
*/
resDF.map { x => sales.apply(x.get(0).asInstanceOf[Int], x.get(1).asInstanceOf[Int],x.get(2).asInstanceOf[BigDecimal])
}.rdd.saveToCassandra("raaspoc","sales_data") // Not working
resDF.rdd.saveToCassandra("raaspoc","sales_data") // Not working
resDF.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "sales_data", "keyspace" -> "raaspoc")).save() // Not working
resDF.write.cassandraFormat("sales_data","raaspoc").save() // Not working
/*
When the data frame is written to HDFS, i see all 4000+ records in the sales.csv
*/
resDF.write.format("csv").save("hdfs:/dev/test/sales.csv")
println("RESDF size after write to cassandra = " + resDF.count()) //prints 4732 (record count)
spark.close()
}
我没有看到日志中的任何错误,Spark提交完成没有任何错误,但只插入一条记录。以下是我的pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.test.raas</groupId>
<artifactId>RaasDataPipelines</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.0</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.7</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.11</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.0.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.5.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-mapping</artifactId>
<version>3.5.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-extras</artifactId>
<version>3.5.0</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>
shade
</goal>
</goals>
</execution>
</executions>
<configuration>
<minimizeJar>true</minimizeJar>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>fat</shadedClassifierName>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>shaded.guava</shadedPattern>
<includes>
<include>com.google.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional</exclude>
<exclude>com.google.common.base.Absent</exclude>
<exclude>com.google.common.base.Present</exclude>
</excludes>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
答案 0 :(得分:2)
您的主键(wk_nbr)是否有可能在所有4000多行上相同?