我正在使用Hadoop 2.6的Cloudera 5.16。
我使用ImportTsv将大型csv文件加载到HBase中。
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=';' -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:age mynamespace:mytable /path/to/csv/dir/*.csv
我的问题是,无论文件大小如何(我有300k行的文件,而其他文件有1000k的行),该操作需要20到30秒。
19/08/22 15:11:56 INFO mapreduce.Job: Job job_1566288518023_0335 running in uber mode : false
19/08/22 15:11:56 INFO mapreduce.Job: map 0% reduce 0%
19/08/22 15:12:06 INFO mapreduce.Job: map 67% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: map 100% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: Job job_1566288518023_0335 completed successfully
19/08/22 15:12:08 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=801303
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2709617
HDFS: Number of bytes written=0
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=25662
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=25662
Total vcore-milliseconds taken by all map tasks=25662
Total megabyte-milliseconds taken by all map tasks=26277888
Map-Reduce Framework
Map input records=37635
Map output records=37635
Input split bytes=531
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=454
CPU time spent (ms)=14840
Physical memory (bytes) snapshot=1287696384
Virtual memory (bytes) snapshot=8280121344
Total committed heap usage (bytes)=2418540544
Peak Map Physical memory (bytes)=439844864
Peak Map Virtual memory (bytes)=2776657920
ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=2709086
File Output Format Counters
Bytes Written=0
我已经根据密钥创建了多个区域来分配看跌期权,但没有任何改变。
create 'mynamespace:mytable', {NAME => 'data', COMPRESSION => 'SNAPPY'}, {SPLITS => ['0','1','2','3','4','5']}
有人知道如何优化此操作吗?
谢谢。
答案 0 :(得分:0)
我认为您可以采取一些措施来改善这一点:
我建议通过添加以下内容来设置表格的区域数量:
NUMREGIONS => "some reasonable number depending on the size of the initial table"
当我说初始表时,我的意思是容纳您将要加载到其中的数据量。在这一点上不一定要容纳以后将逐渐添加的数据。 (因为您不想运行“一半”的空区域进程)
SPLITALGO => 'UniformSplit'
我还建议您在Google上搜索一下上述内容。
我真的不知道您的特定用例,因此我无法为您提供更深入的答案,但是我相信这些将有助于您提高将数据导入到表中的性能。