这是我的CSV文件的格式:
Chevrolet C10,13.0,8,350.0,145.0,4055,12.0,76,US
Ford F108,13.0,8,302.0,130.0,3870,15.0,76,US
Dodge D100,13.0,8,318.0,150.0,3755,14.0,76,US
Honda Accord CVCC,31.5,4,98.00,68.00,2045,18.5,77,Japan
Buick Opel Isuzu Deluxe,30.0,4,111.0,80.00,2155,14.8,77,US
Renault 5 GTL,36.0,4,79.00,58.00,1825,18.6,77,Europe
Plymouth Arrow GS,25.5,4,122.0,96.00,2300,15.5,77,US
我想分割第一个字段,比如 雪佛兰C10应该是雪佛兰 福特F108应该是福特 本田雅阁CVCC应该是本田等,然后我将使用汽车名称进行进一步处理。
答案 0 :(得分:1)
猪的解决方案
代码:
read = LOAD 'test.data' USING PigStorage(',') AS (name:chararray, val1:long, val2:long, val3:long, val4:long, val5:long, val6:long, country:chararray);
sub_data = FOREACH read GENERATE SUBSTRING(name,0,(INDEXOF(name, ' ',0))) AS (subname:chararray);
DUMP sub_data;
输出
(Chevrolet)
(Ford)
(Dodge)
(Honda)
(Buick)
(Renault)
(Plymouth)
答案 1 :(得分:0)
select
case when MODEL like 'US % %' or MODEL like 'Europe % %'
then regexp_extract(MODEL, '^([^ ]* [^ ]*) ', 1)
when MODEL like '% %'
then regexp_extract(MODEL, '^([^ ]*) ', 1)
else MODEL
end as BRAND
from WHATEVER
答案 2 :(得分:0)
答案 3 :(得分:0)
创建一个包含表格所需模式的表格。
CREATE TABLE carinfo (carname STRING, val1 DOUBLE, val2 INT, val3 DOUBLE, val4 DOUBLE, val5 INT, val6 DOUBLE, val7 INT, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
将数据加载到上表
LOAD DATA LOCAL INPATH '/hivesamples/splitstr.txt' OVERWRITE INTO TABLE carinfo;
使用CTAS
拆分车名并获取品牌名称。这个新表将具有您之前定义的相同模式。
CREATE TABLE modified_carinfo
AS
SELECT split(carname, ' ')[0] as carname, val1, val2, val3, val4, val5 ,val6, val7, country
FROM carinfo;