我在单节点AWS上安装了Cassandra。我试图通过Spark将1百万行(70MB)的CSV文件从HDFS批量加载到Cassandra。但我注意到cassandra只写了4789行。我是否必须更改Spark或Cassandra中的任何属性?
$ hadoop fs -ls par.txt
Picked up _JAVA_OPTIONS: -Xms1024m -Xmx1024m
-rw------- 3 u**** u**** 707531 2015-09-24 10:33 par.txt
Cassandra表:
CREATE TABLE party1 (
id text,
country text,
email text,
first_name text,
ip_address text,
last_name text,
PRIMARY KEY (id)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.000000 AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
cqlsh:cassdb> select count(*) from party1;
count
-------
4789
(1 rows)
Spark命令:
val cassing = sc.textFile("par.txt").map(line => line.split(",")).map(p => (p(0),p(1),p(2),p(3),p(4),p(5)))
scala> cassing.saveToCassandra("cassdb", "party1", SomeColumns("id","first_name","last_name","email","country","ip_address"))