使用spark-cassandra连接器

时间:2015-07-02 18:37:28

标签: cassandra apache-spark spark-streaming spark-cassandra-connector

我有这个用例,我需要不断地听一个kafka主题并根据Spark流媒体应用程序的列值写入2000个列系列(每列15列......时间序列数据)。我有一个本地Cassandra安装设置。在CentOS VM上创建这些列系列需要大约1.5小时,使用3个内核和12 GB内存。在我的火花流媒体应用程序中,我正在进行一些预处理,以便将这些流事件存储到Cassandra。我遇到了流媒体应用程序完成此操作所需的时间问题   我试图将300个事件保存到多个列系列(大约200-250),基于密钥,我的应用程序需要大约10分钟来保存它们。这似乎很奇怪,因为将这些事件打印到按键分组的屏幕上只需不到一分钟,但只有当我将它们保存到Cassandra时才需要时间。   我没有任何问题,将300万的记录保存到Cassandra。花了不到3分钟(但这是Cassandra的一个专栏家庭)。

我的要求是尽可能实时,这似乎无处接近。生产环境每3秒就会有大约400个事件。

我是否需要对Cassandra中的YAML文件进行调整或对cassandra-connector本身进行任何更改

INFO  05:25:14 system_traces.events                      0,0
WARN  05:25:14 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:14 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:16 ParNew GC in 340ms.  CMS Old Gen: 1308020680 -> 1454559048; Par Eden Space: 251658240 -> 0; 
WARN  05:25:16 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:16 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:17 ParNew GC in 370ms.  CMS Old Gen: 1498825040 -> 1669094840; Par Eden Space: 251658240 -> 0; 
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:19 ParNew GC in 382ms.  CMS Old Gen: 1714792864 -> 1875460032; Par Eden Space: 251658240 -> 0; 
W

2 个答案:

答案 0 :(得分:1)

我怀疑你在cassandra中遇到与模式中定义的大量CF /列相关的边缘情况。通常,当您看到墓碑警告时,这是因为您搞砸了数据模型。但是,这些都在系统表中,所以很明显你已经对作者没想到的表做了些什么(很多很多表,可能会丢弃/重新创建它们)。

添加了这些警告是因为扫描过去寻找实时列的逻辑删除会导致内存压力,导致GC导致暂停,从而导致缓慢。

您可以将数据压缩到明显更少的列族吗?您可能还想尝试清除逻辑删除(将该表的gcgs降为零,如果允许则在系统上运行主要压缩?,将其恢复为默认值)。

答案 1 :(得分:0)

您可以参考此blog进行Spark-Cassandra连接器调整。您将了解可以预期的性能数据。此外,您可以尝试另一个开源产品SnappyData,它是Spark数据库,它将为您的用例提供非常高的性能。