我正在尝试运行一个spark作业(与Cassandra交谈)来读取数据,进行一些聚合,然后将聚合写入Cassandra
1)它是一种优化算法(或)有没有更好的方法呢?我感觉到某些事情是不正确的,我没有看到加速。看起来正在为每个spark动作(收集)创建和关闭Cassandra客户端。需要很长时间来处理小数据集。
2)Spark工作人员与cassandra不在同一地点,这意味着spark工作者在不同节点(容器)中运行而不是C *节点(我们可能将spark worker移动到C *节点以获取数据位置)
3)我看到火花作业正在为每个火花动作(收集)创建/提交,我相信这是火花的预期行为,无论如何都要从C *中减少读取并创建联接以便数据反向很快?
4)这个算法的缺点是什么?你能推荐更好的设计方法,即w / r / t分区策略,将C *分区加载到Spark分区,执行器/驱动程序的内存需求吗?
5)只要算法和设计方法很好,那么我可以玩火花调整。我正在使用5名工作人员(每人有16个CPU和64GB RAM)
CREATE TABLE analytics.monthly_active_users (
month text,
app_id uuid,
user_id uuid,
PRIMARY KEY (month, app_id, user_id)
) WITH CLUSTERING ORDER BY (app_id ASC, user_id ASC)
cqlsh:analytics> select * from monthly_active_users limit 2;
month | app_id | user_id
--------+--------------------------------------+--------------------------------------
2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1
2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 2c70a31a-031c-4dbf-8dbd-e2ce7bdc2bc7
CREATE TABLE analytics.daily_user_metric_aggregates (
metric_date timestamp,
user_id uuid,
metric_name text,
"count" counter,
PRIMARY KEY (metric_date, user_id, metric_name)
) WITH CLUSTERING ORDER BY (user_id ASC, metric_name ASC)
cqlsh:analytics> select * from daily_user_metric_aggregates where metric_date='2015-02-08' and user_id=199c0a31-8e74-46d9-9b3c-04f67d58b4d1;
metric_date | user_id | metric_name | count
--------------------------+--------------------------------------+-------------------+-------
2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | md | 1
2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | ms | 1
import java.net.InetAddress
import java.util.concurrent.atomic.AtomicLong
import java.util.{Date, UUID}
import com.datastax.spark.connector.util.Logging
import org.apache.spark.{SparkConf, SparkContext}
import org.joda.time.{DateTime, DateTimeZone}
import scala.collection.mutable.ListBuffer
object MonthlyActiveUserAggregate extends App with Logging {
val KeySpace: String = "analytics"
val MauTable: String = "mau"
val CassandraHostProperty = "CASSANDRA_HOST"
val CassandraDefaultHost = "127.0.0.1"
val CassandraHost = InetAddress.getByName(sys.env.getOrElse(CassandraHostProperty, CassandraDefaultHost))
val conf = new SparkConf().setAppName(getClass.getSimpleName)
.set("spark.cassandra.connection.host", CassandraHost.getHostAddress)
lazy val sc = new SparkContext(conf)
import com.datastax.spark.connector._
def now = new DateTime(DateTimeZone.UTC)
val metricMonth = now.getYear + "-" + now.getMonthOfYear
private val mauMonthSB: StringBuilder = new StringBuilder
mauMonthSB.append(now.getYear).append("-")
if (now.getMonthOfYear < 10) mauMonthSB.append("0")
mauMonthSB.append(now.getMonthOfYear).append("-")
if (now.getDayOfMonth < 10) mauMonthSB.append("0")
mauMonthSB.append(now.getDayOfMonth)
private val mauMonth: String = mauMonthSB.toString()
val dates = ListBuffer[String]()
for (day <- 1 to now.dayOfMonth().getMaximumValue) {
val metricDate: StringBuilder = new StringBuilder
metricDate.append(now.getYear).append("-")
if (now.getMonthOfYear < 10) metricDate.append("0")
metricDate.append(now.getMonthOfYear).append("-")
if (day < 10) metricDate.append("0")
metricDate.append(day)
dates += metricDate.toString()
}
private val metricName: List[String] = List("ms", "md")
val appMauAggregate = scala.collection.mutable.Map[String, scala.collection.mutable.Map[UUID, AtomicLong]]()
case class MAURecord(month: String, appId: UUID, userId: UUID) extends Serializable
case class DUMARecord(metricDate: Date, userId: UUID, metricName: String) extends Serializable
case class MAUAggregate(month: String, appId: UUID, total: Long) extends Serializable
private val mau = sc.cassandraTable[MAURecord]("analytics", "monthly_active_users")
.where("month = ?", metricMonth)
.collect()
mau.foreach { monthlyActiveUser =>
val duma = sc.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates")
.where("metric_date in ? and user_id = ? and metric_name in ?", dates, monthlyActiveUser.userId, metricName)
//.map(_.userId).distinct().collect()
.collect()
if (duma.length > 0) { // if user has `ms` for the given month
if (!appMauAggregate.isDefinedAt(mauMonth)) {
appMauAggregate += (mauMonth -> scala.collection.mutable.Map[UUID, AtomicLong]())
}
val monthMap: scala.collection.mutable.Map[UUID, AtomicLong] = appMauAggregate(mauMonth)
if (!monthMap.isDefinedAt(monthlyActiveUser.appId)) {
monthMap += (monthlyActiveUser.appId -> new AtomicLong(0))
}
monthMap(monthlyActiveUser.appId).incrementAndGet()
} else {
println(s"No message_sent in daily_user_metric_aggregates for user: $monthlyActiveUser")
}
}
for ((metricMonth: String, appMauCounts: scala.collection.mutable.Map[UUID, AtomicLong]) <- appMauAggregate) {
for ((appId: UUID, total: AtomicLong) <- appMauCounts) {
println(s"month: $metricMonth, app_id: $appId, total: $total");
val collection = sc.parallelize(Seq(MAUAggregate(metricMonth.substring(0, 7), appId, total.get())))
collection.saveToCassandra(KeySpace, MauTable, SomeColumns("month", "app_id", "total"))
}
}
sc.stop()
}
感谢。
答案 0 :(得分:1)
您的解决方案效率最低。您正在通过逐个查找每个密钥来执行连接,从而避免任何可能的并行化。
我从未使用过Cassandra连接器,但我知道它会返回RDD。所以你可以这样做:
val mau: RDD[(UUID, MAURecord)] = sc
.cassandraTable[MAURecord]("analytics", "monthly_active_users")
.where("month = ?", metricMonth)
.map(u => u.userId -> u) // Key by user ID.
val duma: RDD[(UUID, DUMARecord)] = sc
.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates")
.where("metric_date in ? metric_name in ?", dates, metricName)
.map(a => a.userId -> a) // Key by user ID.
// Count "duma" by key.
val dumaCounts: RDD[(UUID, Long)] = duma.countByKey
// Join to "mau". This drops "mau" entries that have no count
// and "duma" entries that are not present in "mau".
val joined: RDD[(UUID, (MAURecord, Long))] = mau.join(dumaCounts)
// Get per-application counts.
val appCounts: RDD[(UUID, Long)] = joined
.map { case (u, (mau, count)) => mau.appId -> 1 }
.countByKey
答案 1 :(得分:1)
有一个参数spark.cassandra.connection.keep_alive_ms,用于控制保持连接打开的时间。请查看文档page。
如果您将Spark Workers与Cassandra节点共存,连接器将利用此功能并相应地创建分区,以便执行程序始终从本地节点获取数据。
我可以看到您可以在DUMA表中进行一些设计改进:metric_date似乎不是分区键的最佳选择 - 考虑使(user_id,metric_name)成为分区键,因为在这种情况下您不必生成日期查询 - 您只需将user_id和metrics_name放入where子句即可。此外,您可以向主键添加月份标识符 - 然后,每个分区将仅包含与您要为每个查询获取的内容相关的信息。
无论如何,目前正在实施Spark-Cassandra-Connector中的连接功能(参见this ticket)。