Question

我的Hbase行键：用户+＆＃34; - ＆＃34; + timestamp，其中user是用户的名称。同一用户可以在rowkey中有多个具有不同时间戳的条目。

使用案例：在用户+＆＃34; _＆＃34;之间选择HBase记录+ timestamp_start直到用户+＆＃34; _＆＃34; +用户列表中每个用户的timestamp_end。

timestamp_start＆lt; timestamp_end

当前实施正在运行，但已在用户上序列化：

users = [user1, user2, ....] #30 million users   
sc = SparkContext()
conf = dict()

for user in users:
        # some config params go here
        conf["hbase.mapreduce.scan.row.start"] = user + "-" + str(timestamp_start)
        conf["hbase.mapreduce.scan.row.stop"] = user + "-" + str(timestamp_end)

        hbase_rdd = sc.newAPIHadoopRDD(
        "org.apache.hadoop.hbase.mapreduce.TableInputFormat",
        "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
        "org.apache.hadoop.hbase.client.Result",
        keyConverter=keyConv,
        valueConverter=valueConv,
        conf=conf) #this conf is same conf created above

有没有办法在用户列表上并行化Hbase扫描，主观上为每个用户指定起始行和结束行，因为完整扫描需要花费大量时间来处理？

使用Pyspark并行化Hbase Scan

0 个答案: