Question

我有一个30k记录的日志文件，我将从Kafka发布并通过spark我将其保存到HBase中。在30K记录中，我只能看到HBase表中的4K记录。

我尝试在MySQL中保存流，并正确保存MySql中的所有记录。
但是在HBase中如果我在Kafka主题中发布了100条记录的文件，它会在HBase表中保存36条记录，如果我发布30K记录，Hbase只会显示4k条记录。
此外，HBase中的记录（行）不是像1..3..10..17那样的顺序。 final Job newAPIJobConfiguration1 = Job.getInstance(config); newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "logs"); newAPIJobConfiguration1.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class); HTable hTable = new HTable(config, "country"); lines.foreachRDD((rdd,time)-> { // Get the singleton instance of SparkSession SparkSession spark = SparkSession.builder().config(rdd.context().getConf()).getOrCreate(); // Convert RDD[String] to RDD[case class] to DataFrame JavaRDD rowRDD = rdd.map(line -> { String[] logLine = line.split(" +"); Log record = new Log(); record.setTime((logLine[0])); record.setTime_taken((logLine[1])); record.setIp(logLine[2]); return record; }); saveToHBase(rowRDD, newAPIJobConfiguration1.getConfiguration()); }); ssc.start(); ssc.awaitTermination(); } //6. saveToHBase method - insert data into HBase public static void saveToHBase(JavaRDD rowRDD, Configuration conf) throws IOException { // create Key, Value pair to store in HBase JavaPairRDD hbasePuts = rowRDD.mapToPair( new PairFunction() { private static final long serialVersionUID = 1L; @Override public Tuple2 call(Log row) throws Exception { Put put = new Put(Bytes.toBytes(System.currentTimeMillis())); //put.addColumn(Bytes.toBytes("sparkaf"), Bytes.toBytes("message"), Bytes.toBytes(row.getMessage())); put.addImmutable(Bytes.toBytes("time"), Bytes.toBytes("col1"), Bytes.toBytes(row.getTime())); put.addImmutable(Bytes.toBytes("time_taken"), Bytes.toBytes("col2"), Bytes.toBytes(row.getTime_taken())); put.addImmutable(Bytes.toBytes("ip"), Bytes.toBytes("col3"), Bytes.toBytes(row.getIp())); return new Tuple2(new ImmutableBytesWritable(), put); } }); // save to HBase- Spark built-in API method //hbasePuts.saveAsNewAPIHadoopDataset(conf); hbasePuts.saveAsNewAPIHadoopDataset(conf);

Answer 1

由于HBase通过rowkey唯一地存储记录，因此您很可能覆盖记录。

您使用currentTime（以毫秒为单位）作为rowkey，使用相同rowkey创建的任何记录都将覆盖旧的。

Put put = new Put(Bytes.toBytes(System.currentTimeMillis()));

因此，如果在1毫秒内创建了100个Puts，那么只有100个将显示在HBase中，因为相同的行被覆盖了99次。

HBase中的4k rowkeys可能是加载数据所花费的4k唯一毫秒（4秒）。

我建议使用不同的rowkey设计。另外，作为旁注，在HBase中使用单调增加的rowkeys通常是个坏主意： Further Information

Spark to Hbase Table未显示完整的数据记录

1 个答案: