Question

我正在尝试运行一个spark作业（与Cassandra交谈）来读取数据，进行一些聚合，然后将聚合写入Cassandra

我有2个表（monthly_active_users（MAU），daily_user_metric_aggregates（DUMA））
对于MAU中的每条记录，DUMA中都会有一条或多条记录
获取MAU中的所有记录并获取其中的user_id，然后在该DUMA中查找该用户的记录（服务器端过滤器应用为metric_name中的（＆＃39; ms＆＃39;，＆＃39; md＆＃39;）< / LI>
如果DUMA中有一个或多个指定where子句的记录，那么我需要增加appMauAggregate地图的数量（应用程序明智的MAU计数）
我测试了这个算法，按预期工作，但我想找出

1）它是一种优化算法（或）有没有更好的方法呢？我感觉到某些事情是不正确的，我没有看到加速。看起来正在为每个spark动作（收集）创建和关闭Cassandra客户端。需要很长时间来处理小数据集。

2）Spark工作人员与cassandra不在同一地点，这意味着spark工作者在不同节点（容器）中运行而不是C *节点（我们可能将spark worker移动到C *节点以获取数据位置）

3）我看到火花作业正在为每个火花动作（收集）创建/提交，我相信这是火花的预期行为，无论如何都要从C *中减少读取并创建联接以便数据反向很快？

4）这个算法的缺点是什么？你能推荐更好的设计方法，即w / r / t分区策略，将C *分区加载到Spark分区，执行器/驱动程序的内存需求吗？

5）只要算法和设计方法很好，那么我可以玩火花调整。我正在使用5名工作人员（每人有16个CPU和64GB RAM）

C *架构：

MAU：

CREATE TABLE analytics.monthly_active_users ( 
    month text, 
    app_id uuid,
    user_id uuid, 
    PRIMARY KEY (month, app_id, user_id) 
) WITH CLUSTERING ORDER BY (app_id ASC, user_id ASC)

数据：

cqlsh:analytics> select * from monthly_active_users limit 2;   
 month  | app_id                               | user_id 

--------+--------------------------------------+-------------------------------------- 
 2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 
 2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 2c70a31a-031c-4dbf-8dbd-e2ce7bdc2bc7

DUMA：

CREATE TABLE analytics.daily_user_metric_aggregates ( 
    metric_date timestamp, 
    user_id uuid,
    metric_name text, 
    "count" counter, 
    PRIMARY KEY (metric_date, user_id, metric_name)
) WITH CLUSTERING ORDER BY (user_id ASC, metric_name ASC)

数据：

cqlsh:analytics> select * from daily_user_metric_aggregates where metric_date='2015-02-08' and user_id=199c0a31-8e74-46d9-9b3c-04f67d58b4d1; 
 metric_date | user_id                                                         | metric_name       | count 
--------------------------+--------------------------------------+-------------------+------- 
 2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | md                      |     1     
 2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | ms                      |     1

Spark Job：

import java.net.InetAddress 
import java.util.concurrent.atomic.AtomicLong 
import java.util.{Date, UUID} 

import com.datastax.spark.connector.util.Logging 
import org.apache.spark.{SparkConf, SparkContext} 
import org.joda.time.{DateTime, DateTimeZone} 

import scala.collection.mutable.ListBuffer 

object MonthlyActiveUserAggregate extends App with Logging { 

    val KeySpace: String = "analytics" 
    val MauTable: String = "mau" 

    val CassandraHostProperty = "CASSANDRA_HOST" 
    val CassandraDefaultHost = "127.0.0.1" 
    val CassandraHost = InetAddress.getByName(sys.env.getOrElse(CassandraHostProperty, CassandraDefaultHost)) 

    val conf = new SparkConf().setAppName(getClass.getSimpleName) 
        .set("spark.cassandra.connection.host", CassandraHost.getHostAddress) 

    lazy val sc = new SparkContext(conf) 
    import com.datastax.spark.connector._ 

    def now = new DateTime(DateTimeZone.UTC) 
    val metricMonth = now.getYear + "-" + now.getMonthOfYear 

    private val mauMonthSB: StringBuilder = new StringBuilder 
    mauMonthSB.append(now.getYear).append("-") 
    if (now.getMonthOfYear < 10) mauMonthSB.append("0") 
    mauMonthSB.append(now.getMonthOfYear).append("-") 
    if (now.getDayOfMonth < 10) mauMonthSB.append("0") 
    mauMonthSB.append(now.getDayOfMonth) 

    private val mauMonth: String = mauMonthSB.toString() 

    val dates = ListBuffer[String]() 
    for (day <- 1 to now.dayOfMonth().getMaximumValue) { 
        val metricDate: StringBuilder = new StringBuilder 
        metricDate.append(now.getYear).append("-") 
        if (now.getMonthOfYear < 10) metricDate.append("0") 
        metricDate.append(now.getMonthOfYear).append("-") 
        if (day < 10) metricDate.append("0") 
        metricDate.append(day) 
        dates += metricDate.toString() 
    } 

    private val metricName: List[String] = List("ms", "md") 
    val appMauAggregate = scala.collection.mutable.Map[String, scala.collection.mutable.Map[UUID, AtomicLong]]() 

    case class MAURecord(month: String, appId: UUID, userId: UUID) extends Serializable 
    case class DUMARecord(metricDate: Date, userId: UUID, metricName: String) extends Serializable 
    case class MAUAggregate(month: String, appId: UUID, total: Long) extends Serializable 

    private val mau = sc.cassandraTable[MAURecord]("analytics", "monthly_active_users") 
        .where("month = ?", metricMonth) 
        .collect() 

    mau.foreach { monthlyActiveUser => 
        val duma = sc.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates") 
            .where("metric_date in ? and user_id = ? and metric_name in ?", dates, monthlyActiveUser.userId, metricName) 
            //.map(_.userId).distinct().collect() 
            .collect() 

        if (duma.length > 0) { // if user has `ms` for the given month 
            if (!appMauAggregate.isDefinedAt(mauMonth)) { 
                appMauAggregate += (mauMonth -> scala.collection.mutable.Map[UUID, AtomicLong]()) 
            } 
            val monthMap: scala.collection.mutable.Map[UUID, AtomicLong] = appMauAggregate(mauMonth) 
            if (!monthMap.isDefinedAt(monthlyActiveUser.appId)) { 
                monthMap += (monthlyActiveUser.appId -> new AtomicLong(0)) 
            } 
            monthMap(monthlyActiveUser.appId).incrementAndGet() 
        } else { 
            println(s"No message_sent in daily_user_metric_aggregates for user: $monthlyActiveUser") 
        } 

    } 
    for ((metricMonth: String, appMauCounts: scala.collection.mutable.Map[UUID, AtomicLong]) <- appMauAggregate) { 
        for ((appId: UUID, total: AtomicLong) <- appMauCounts) { 
            println(s"month: $metricMonth, app_id: $appId, total: $total"); 
            val collection = sc.parallelize(Seq(MAUAggregate(metricMonth.substring(0, 7), appId, total.get()))) 
            collection.saveToCassandra(KeySpace, MauTable, SomeColumns("month", "app_id", "total")) 
        } 
    } 
    sc.stop() 
}

感谢。

Answer 1

您的解决方案效率最低。您正在通过逐个查找每个密钥来执行连接，从而避免任何可能的并行化。

我从未使用过Cassandra连接器，但我知道它会返回RDD。所以你可以这样做：

val mau: RDD[(UUID, MAURecord)] = sc
    .cassandraTable[MAURecord]("analytics", "monthly_active_users") 
    .where("month = ?", metricMonth)
    .map(u => u.userId -> u)  // Key by user ID.
val duma: RDD[(UUID, DUMARecord)] = sc
    .cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates") 
    .where("metric_date in ? metric_name in ?", dates, metricName)
    .map(a => a.userId -> a)  // Key by user ID.
// Count "duma" by key.
val dumaCounts: RDD[(UUID, Long)] = duma.countByKey
// Join to "mau". This drops "mau" entries that have no count
// and "duma" entries that are not present in "mau".
val joined: RDD[(UUID, (MAURecord, Long))] = mau.join(dumaCounts)
// Get per-application counts.
val appCounts: RDD[(UUID, Long)] = joined
    .map { case (u, (mau, count)) => mau.appId -> 1 }
    .countByKey

Answer 2

有一个参数spark.cassandra.connection.keep_alive_ms，用于控制保持连接打开的时间。请查看文档page。
如果您将Spark Workers与Cassandra节点共存，连接器将利用此功能并相应地创建分区，以便执行程序始终从本地节点获取数据。

我可以看到您可以在DUMA表中进行一些设计改进：metric_date似乎不是分区键的最佳选择 - 考虑使（user_id，metric_name）成为分区键，因为在这种情况下您不必生成日期查询 - 您只需将user_id和metrics_name放入where子句即可。此外，您可以向主键添加月份标识符 - 然后，每个分区将仅包含与您要为每个查询获取的内容相关的信息。

无论如何，目前正在实施Spark-Cassandra-Connector中的连接功能（参见this ticket）。

Spark连接器：分区使用和性能问题

C *架构：

MAU：

数据：

DUMA：

数据：

Spark Job：

2 个答案: