如何将数据帧存储到具有重复密钥的库中?

时间:2018-08-13 17:22:55

标签: apache-spark hbase

我有一个像这样的数据框:

+-------------------+-------+----------+--------------------+--------------------+
|               time|    key|  sequence|                kind|              binary|
+-------------------+-------+----------+--------------------+--------------------+
|1528848000000000000|BTC-USD|6066746240|              EventA|[0E 42 54 43 2D 5...|
|1528848000000000000|BTC-USD|6066746241|              EventB|[0E 42 54 43 2D 5...|
|1528848000000000000|BTC-USD|6066746242|              EventC|[0E 42 54 43 2D 5...|
|1528848000001000000|BTC-USD|6066746243|              EventB|[0E 42 54 43 2D 5...|
|1528848000001000000|BTC-USD|6066746244|              EventA|[0E 42 54 43 2D 5...|
|1528848000003000000|BTC-USD|6066746245|              EventA|[0E 42 54 43 2D 5...|
|1528848000003000000|BTC-USD|6066746246|              EventC|[0E 42 54 43 2D 5...|
|1528848000003000000|BTC-USD|6066746247|              EventA|[0E 42 54 43 2D 5...|
+-------------------+-------+----------+--------------------+--------------------+

这是典型的时间序列数据,并按时间列排序。但是时间列可能有重复的条目(多个事件可能同时发生)

现在,我想使用saveAsNewAPIHadoopDataset将此数据保存到hbase中。这是我在Scala中所做的:

rdd.map{row =>
  val put = new Put(Bytes.toBytes(row.time.toString))
  put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.keyColumnBinary, Bytes.toBytes(row.key))
  put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.sequenceColumnBinary, Bytes.toBytes(row.sequence))
  put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.kindColumnBinary, Bytes.toBytes(row.kind))
  put.addColumn(SequencedMessageRow.messagesColumnFamilyBinary, SequencedMessageRow.binaryColumnBinary, row.binary)
  (new ImmutableBytesWritable(), put)
}.saveAsNewAPIHadoopDataset(jobConfig.getConfiguration)

但是因为时间字段不是唯一的,所以我丢失了许多重复的行。那么人们如何将这种数据帧保存到HBase中呢?

0 个答案:

没有答案