使用Spark加载时,Cassandra只写4789行

时间:2015-09-28 07:06:44

标签: hadoop cassandra apache-spark hdfs

我在单节点AWS上安装了Cassandra。我试图通过Spark将1百万行(70MB)的CSV文件从HDFS批量加载到Cassandra。但我注意到cassandra只写了4789行。我是否必须更改Spark或Cassandra中的任何属性?

$ hadoop fs -ls par.txt
Picked up _JAVA_OPTIONS: -Xms1024m -Xmx1024m
-rw-------   3 u**** u****     707531 2015-09-24 10:33 par.txt

Cassandra表:

CREATE TABLE party1 (
  id text,
  country text,
  email text,
  first_name text,
  ip_address text,
  last_name text,
  PRIMARY KEY (id)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.100000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.000000 AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

cqlsh:cassdb> select count(*) from party1;

 count
-------
  4789

(1 rows)

Spark命令:

val cassing = sc.textFile("par.txt").map(line => line.split(",")).map(p => (p(0),p(1),p(2),p(3),p(4),p(5)))

scala> cassing.saveToCassandra("cassdb", "party1", SomeColumns("id","first_name","last_name","email","country","ip_address")) 

0 个答案:

没有答案