在Sqoop文件导入中,我想使用定义的映射器控制文件拆分中的导入数据

时间:2015-07-03 11:04:04

标签: java hadoop sqoop

MySQL - >从员工中选择*

empno | empname      | salary 
======================================================
|   101 | Ram          |   5000 |    
|   102 | Hari         |   7000 |   
|   104 | Vamshi       |   7000 |   
|   103 | Revathy      |   7000 |  
|   105 | Jaya         |   9000 |  
|   106 | Suresh       |   8000 |  
|   107 | Ramesh       |   9000 |  
|   108 | Prasana      |  10000 |  
|   109 | Ramsamy      |  20000 |  
|   110 | Singaram     |  30000 |  
|   200 | ramanathan   |  30000 |  
|   201 | Victor       |  33000 |  
|   202 | Naveen       |  33000 |  
|   203 | Karthik      |  33000 |  
|   204 | Karthikeyan  |  33000 |   
|   205 | Somasundaram |  43000 |   
|   301 | Test1        |  50000 |   
|   302 | Test2        |  60000 |   
|   303 | Test3        |  70000 

Command in Sqoop

sqoop import  --connect jdbc:mysql://<hostname>/test --username <username> --password <password> --table employee 
--direct --verbose
 --split-by salary 

By giving above command, it takes min(salary), max(salary) and moves to HDFS table by 10 records in first file,
 3 records in second file,
 3 records in third file & 3 records in last file

    5/07/03 17:32:37 INFO db.DataDrivenDBInputFormat:
 BoundingValsQuery: SELECT MIN(`salary`), MAX(`salary`) FROM employee

15/07/03 17:32:37 DEBUG db.IntegerSplitter: Splits: [      
                 5,000 to 70,000] into 4 parts
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 5,000

15/07/03 17:32:37 DEBUG db.IntegerSplitter: 21,250
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 37,500
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 53,750
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 70,000
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 5000' and upper bound '`salary` < 21250'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 21250' and upper bound '`salary` < 37500'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 37500' and upper bound '`salary` < 53750'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 53750' and upper bound '`salary` <= 70000'
15/07/03 17:32:37 INFO mapreduce.JobSubmitter: number of splits:4

我想知道它如何对每个文件中的记录数进行分类。这是可定制的。

1 个答案:

答案 0 :(得分:0)

薪水范围为5000 - 70000 (i.e. min 5000, max 70000)。薪水分为4个班级。

(70000 - 5000 )/4=16250

因此,

split 1 : from 5000 to 21,250(=5000+16250)
split 2 : from 21250 to 37500(=21250+16250)
split 3 : from 37500 to 53750(=37500+16250)
split 4 : from 53750 to 70000(=53750+16250)