基于spacein Hive拆分字符串

时间:2015-08-07 15:04:24

标签: string split hive apache-pig

这是我的CSV文件的格式:

Chevrolet C10,13.0,8,350.0,145.0,4055,12.0,76,US
Ford F108,13.0,8,302.0,130.0,3870,15.0,76,US
Dodge D100,13.0,8,318.0,150.0,3755,14.0,76,US
Honda Accord CVCC,31.5,4,98.00,68.00,2045,18.5,77,Japan
Buick Opel Isuzu Deluxe,30.0,4,111.0,80.00,2155,14.8,77,US
Renault 5 GTL,36.0,4,79.00,58.00,1825,18.6,77,Europe
Plymouth Arrow GS,25.5,4,122.0,96.00,2300,15.5,77,US

我想分割第一个字段,比如 雪佛兰C10应该是雪佛兰 福特F108应该是福特 本田雅阁CVCC应该是本田等,然后我将使用汽车名称进行进一步处理。

4 个答案:

答案 0 :(得分:1)

猪的解决方案

代码:

read = LOAD 'test.data' USING PigStorage(',') AS (name:chararray, val1:long, val2:long, val3:long, val4:long, val5:long, val6:long, country:chararray);
sub_data = FOREACH read GENERATE SUBSTRING(name,0,(INDEXOF(name, ' ',0)))  AS (subname:chararray);
DUMP sub_data;

输出

(Chevrolet)
(Ford)
(Dodge)
(Honda)
(Buick)
(Renault)
(Plymouth)

答案 1 :(得分:0)

select
  case when MODEL like 'US % %' or MODEL like 'Europe % %'
        then regexp_extract(MODEL, '^([^ ]* [^ ]*) ', 1)
        when MODEL like '% %'
        then regexp_extract(MODEL, '^([^ ]*) ', 1)
        else MODEL
  end as BRAND
from WHATEVER
  • 雪佛兰C10 =>雪佛兰
  • 美国本田雅阁=>美国本田
  • Zorglub => Zorglub

答案 2 :(得分:0)

使用以下UDF -

substring_index(string A, string delim, int count)

Reference

答案 3 :(得分:0)

创建一个包含表格所需模式的表格。

CREATE TABLE carinfo (carname STRING, val1 DOUBLE, val2 INT, val3 DOUBLE, val4 DOUBLE, val5 INT, val6 DOUBLE, val7 INT, country STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

将数据加载到上表

LOAD DATA LOCAL INPATH '/hivesamples/splitstr.txt' OVERWRITE INTO TABLE carinfo;

使用CTAS拆分车名并获取品牌名称。这个新表将具有您之前定义的相同模式。

CREATE TABLE modified_carinfo 
AS 
SELECT split(carname, ' ')[0] as carname, val1, val2, val3, val4, val5 ,val6, val7, country 
FROM carinfo;