我有一个像这样的数据框:
+-------------------+-------+----------+--------------------+--------------------+
| time| key| sequence| kind| binary|
+-------------------+-------+----------+--------------------+--------------------+
|1528848000000000000|BTC-USD|6066746240| EventA|[0E 42 54 43 2D 5...|
|1528848000000000000|BTC-USD|6066746241| EventB|[0E 42 54 43 2D 5...|
|1528848000000000000|BTC-USD|6066746242| EventC|[0E 42 54 43 2D 5...|
|1528848000001000000|BTC-USD|6066746243| EventB|[0E 42 54 43 2D 5...|
|1528848000001000000|BTC-USD|6066746244| EventA|[0E 42 54 43 2D 5...|
|1528848000003000000|BTC-USD|6066746245| EventA|[0E 42 54 43 2D 5...|
|1528848000003000000|BTC-USD|6066746246| EventC|[0E 42 54 43 2D 5...|
|1528848000003000000|BTC-USD|6066746247| EventA|[0E 42 54 43 2D 5...|
+-------------------+-------+----------+--------------------+--------------------+
这是典型的时间序列数据,并按时间列排序。但是时间列可能有重复的条目(多个事件可能同时发生)
现在,我想使用saveAsNewAPIHadoopDataset
将此数据保存到hbase中。这是我在Scala中所做的:
rdd.map{row =>
val put = new Put(Bytes.toBytes(row.time.toString))
put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.keyColumnBinary, Bytes.toBytes(row.key))
put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.sequenceColumnBinary, Bytes.toBytes(row.sequence))
put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.kindColumnBinary, Bytes.toBytes(row.kind))
put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.binaryColumnBinary, row.binary)
(new ImmutableBytesWritable(), put)
}.saveAsNewAPIHadoopDataset(jobConfig.getConfiguration)
但是因为时间字段不是唯一的,所以我丢失了许多重复的行。那么人们如何将这种数据帧保存到HBase中呢?