是否可以在PIG中通过定义列字段值

时间:2015-06-24 04:43:46

标签: hadoop hive apache-pig hiveql

假设我有以下结构化数据文件

1298712012061228765236542123049824234209374 1203972012073042198531203948203498023498023 1203712012092329385612350924395798456892345 1234812012101223423498230482034893204820398

这里上面的文件前6位是UserId来自(1-6)接下来8位是year_date from(7-12)column next 6列是Count field来自(13-18),然后同样我有product_id来自( 19-30)和上面平面文件的(31-42)中的Character_values列,所以我希望我的数据格式如下。我的意思是说我想用这个提到的字段加载我的数据。 PIG或HIVE中有没有可用的选项?

enter image description here

2 个答案:

答案 0 :(得分:3)

您可以在猪和蜂巢中使用它。下面有两种解决方案 PIG:

data = LOAD '/data.txt' USING PigStorage() AS (line);
strsplit = FOREACH data GENERATE 
SUBSTRING(line,1,6) AS UserID,
SUBSTRING(line,7,12) AS year_date,
SUBSTRING(line,13,18) AS Count,
SUBSTRING(line,19,30) AS product_id,
SUBSTRING(line,31,42) AS Character_values;  

转储时:
dump strsplit; (29871,29871,29871,29871,29871)
(20397,20397,20397,20397,20397)
(20371,20371,20371,20371,20371)
(23481,23481,23481,23481,23481)

HIVE:

步骤1:创建临时表并加载原始数据;

create table temp(line String)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n';
LOAD DATA INPATH '/data.txt' INTO TABLE temp;  

第2步:创建一个适合您数据的表格。

   create table user(UserID String,year_date String,Count String,product_id String,Character_values String)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n'; 

步骤3:将临时表输入actula表

INSERT INTO TABLE user
SELECT substr(line,0,6),substr(line,7,12),substr(line,13,18),substr(line,19,30),substr(line,31,42)FROM temp;

答案 1 :(得分:2)

你能使用SUBSTRING吗?

A = LOAD 'DATA' USING PigStorage() AS (line); 
B = FOREACH A GENERATE SUBSTRING(line,1,6) AS UserID, SUBSTRING(line,7,12) AS Year_date ...