我正在从事Cassandra Spark作业,我需要找到满足特定条件的特定用户,然后在特定列上执行数学运算,然后将其保存到cassandra
例如,我有以下数据集。我想在满足某些条件时按年龄进行数学运算。
键空间:test_users 表:成员
CREATE TABLE test_users.member (
member_id bigint PRIMARY KEY,
manually_entered boolean,
member_age decimal,
member_name text
)
member_id | manually_entered | member_age | member_name
-----------+------------------+------------+------------------
2 | False | 25.544 | Larry Smith
3 | False | 38.3214 | Karen Dinglebop
7 | True | 10 | Howard Jibble
9 | True | 10 | Whitney Howard
4 | True | 60 | Walter White
10 | True | 10 | Kevin Schmoggins
8 | False | 10.234 | Brett Darrel
5 | False | 19.22 | Kenny Loggins
6 | True | 10 | Joe Dirt
1 | False | 56.232 | Joe Schmoe
我正在尝试弄清楚如何使用org.apache.spark.sql round()
spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.expressions.Window
import spark.implicits._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.joda.time.LocalDate
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.{round}
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SQLContext
val members = spark.
read.
format("org.apache.spark.sql.cassandra").
options(Map( "table" -> "test_users", "keyspace" -> "member" )).
load()
var member_birthdays = members.select("member_id", "manually_entered", "member_age").
where("manually_entered = false and member_age % 1 <> 0").
withColumn("member_age", round(members['member_age'] * 5))
member_birthdays.write.
format("org.apache.spark.sql.cassandra").
mode("Append").
options(Map( "table" -> "test_users", "keyspace" -> "member")).
save()
我无法弄清楚如何完成执行数学运算的任务,并且无法使用round()
更新Spark Cassandra中满足条件的特定字段。
任何见识将不胜感激。
答案 0 :(得分:0)
我更新了org.apache.spark.sql.function的导入,并使用了df['Ratings1'] = df.Ratings.astype(str)
df=df.drop_duplicates(df.columns.difference(['Ratings'])).drop('Ratings1')
而不是col('member_age')
。我能够成功更新列值并保存。
members['member_age']