使用PIG从CSV文件中删除双引号(“)

时间:2015-07-14 14:24:20

标签: hadoop apache-pig

我正在尝试从文件中删除双引号(“)。该字段的某些内容包含”Newyork,NY“等数据。请告诉我该怎么做?我试图从CSV中删除(”)。但它是没有发生。下面给出了步骤代码:

我正在使用猪-x local

打开猪

第一步:

test4 = LOAD '/home/hduser/Desktop/flight_data.csv'
        using PigStorage(',') AS (
YEAR: chararray,
QUARTER: chararray,
MONTH: chararray,
DAY_OF_MONTH: chararray,
DAY_OF_WEEK: chararray,
FL_DATE: chararray,
UNIQUE_CARRIER: chararray,
AIRLINE_ID: chararray,
CARRIER: chararray,
TAIL_NUM: chararray,
FL_NUM: chararray,
ORIGIN: chararray,
ORIGIN_CITY_NAME: chararray,
ORIGIN_STATE_ABR: chararray,
ORIGIN_STATE_FIPS: chararray,
ORIGIN_STATE_NM: chararray,
ORIGIN_WAC: chararray,
DEST: chararray,
DEST_CITY_NAME: chararray,
DEST_STATE_ABR: chararray,
DEST_STATE_FIPS: chararray,
DEST_STATE_NM: chararray,
DEST_WAC: chararray,
CRS_DEP_TIME: chararray,
DEP_TIME: chararray,
DEP_DELAY: chararray,
DEP_DELAY_NEW: chararray,
DEP_DEL15: chararray,
DEP_DELAY_GROUP: chararray,
DEP_TIME_BLK: chararray,
TAXI_OUT: chararray,
WHEELS_OFF: chararray,
WHEELS_ON: chararray,
TAXI_IN: chararray,
CRS_ARR_TIME: chararray,
ARR_TIME: chararray,
ARR_DELAY: chararray,
ARR_DELAY_NEW: chararray,
ARR_DEL15: chararray,
ARR_DELAY_GROUP: chararray,
ARR_TIME_BLK: chararray,
CANCELLED: chararray,
CANCELLATION_CODE: chararray,
DIVERTED: chararray,
CRS_ELAPSED_TIME: chararray,
ACTUAL_ELAPSED_TIME: chararray,
AIR_TIME: chararray,
FLIGHTS: chararray,
DISTANCE: chararray,
DISTANCE_GROUP: chararray,
CARRIER_DELAY: chararray,
WEATHER_DELAY: chararray,
NAS_DELAY: chararray,
SECURITY_DELAY: chararray,
LATE_AIRCRAFT_DELAY: chararray); 

第二步:

new_data = foreach test4 generate
FLATTEN(REGEX_EXTRACT(ORIGIN_CITY_NAME,'."([^"])"',1)) AS StateName;

编写此命令后,在new_data中,变量字段保存为()。 请建议我解决这个问题的一些选择。谢谢你的帮助。

我也试过了另一种方式,如下所示:

aviation_data = foreach test4 generate
REGEX_EXTRACT($0,'([0-9]+)', 1),
REGEX_EXTRACT($1,'([0-9]+)', 1),
REGEX_EXTRACT($2,'([0-9]+)', 1),
REGEX_EXTRACT($3,'([0-9]+)', 1),
REGEX_EXTRACT($4,'([0-9]+)', 1),
REGEX_EXTRACT($5,'([0-9]+)', 1),
REGEX_EXTRACT($6,'([0-9]+)', 1),
REGEX_EXTRACT($7,'([0-9]+)', 1),
REGEX_EXTRACT($8,'([0-9]+)', 1),
REGEX_EXTRACT($9,'([0-9]+)', 1),
REGEX_EXTRACT($10,'([0-9]+)', 1),
REGEX_EXTRACT($11,'([0-9]+)', 1),
REGEX_EXTRACT($12,'([0-9]+)', 1),
REGEX_EXTRACT($13,'([0-9]+)', 1),
REGEX_EXTRACT($14,'([0-9]+)', 1),
REGEX_EXTRACT($15,'([0-9]+)', 1),
REGEX_EXTRACT($16,'([0-9]+)', 1),
REGEX_EXTRACT($17,'([0-9]+)', 1),
REGEX_EXTRACT($18,'([0-9]+)', 1),
REGEX_EXTRACT($19,'([0-9]+)', 1),
REGEX_EXTRACT($20,'([0-9]+)', 1),
REGEX_EXTRACT($21,'([0-9]+)', 1),
REGEX_EXTRACT($22,'([0-9]+)', 1),
REGEX_EXTRACT($23,'([0-9]+)', 1),
REGEX_EXTRACT($24,'([0-9]+)', 1),
REGEX_EXTRACT($25,'([0-9]+)', 1),
REGEX_EXTRACT($26,'([0-9]+)', 1),
REGEX_EXTRACT($27,'([0-9]+)', 1),
REGEX_EXTRACT($28,'([0-9]+)', 1),
REGEX_EXTRACT($29,'([0-9]+)', 1),
REGEX_EXTRACT($30,'([0-9]+)', 1),
REGEX_EXTRACT($31,'([0-9]+)', 1),
REGEX_EXTRACT($32,'([0-9]+)', 1),
REGEX_EXTRACT($33,'([0-9]+)', 1),
REGEX_EXTRACT($34,'([0-9]+)', 1),
REGEX_EXTRACT($35,'([0-9]+)', 1),
REGEX_EXTRACT($36,'([0-9]+)', 1),
REGEX_EXTRACT($37,'([0-9]+)', 1),
REGEX_EXTRACT($38,'([0-9]+)', 1),
REGEX_EXTRACT($39,'([0-9]+)', 1),
REGEX_EXTRACT($40,'([0-9]+)', 1),
REGEX_EXTRACT($41,'([0-9]+)', 1),
REGEX_EXTRACT($42,'([0-9]+)', 1),
REGEX_EXTRACT($43,'([0-9]+)', 1),
REGEX_EXTRACT($44,'([0-9]+)', 1),
REGEX_EXTRACT($45,'([0-9]+)', 1),
REGEX_EXTRACT($46,'([0-9]+)', 1),
REGEX_EXTRACT($47,'([0-9]+)', 1),
REGEX_EXTRACT($48,'([0-9]+)', 1),
REGEX_EXTRACT($49,'([0-9]+)', 1),
REGEX_EXTRACT($50,'([0-9]+)', 1),
REGEX_EXTRACT($51,'([0-9]+)', 1),
REGEX_EXTRACT($52,'([0-9]+)', 1),
REGEX_EXTRACT($53,'([0-9]+)', 1),
REGEX_EXTRACT($54,'([0-9]+)', 1);

结果如下:

(2015,1,1,29,4,2015,,20304,,549,4837,,,,,04,,81,,,,,53,,93,1757,1851,54,54,1,3,1700,19,1910,2034,6,2005,2040,35,35,1,2,2000,0,,0,188,169,144,1,1107,5,0,0,0)

没有任何文字字段即将到来。

1 个答案:

答案 0 :(得分:1)

我们可以使用:org.apache.pig.piggybank.storage.CSVExcelStorage()或org.apache.pig.piggybank.storage.CSVLoader()。

有关详细信息,请参阅以下API链接

http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html

test4 = LOAD '/home/hduser/Desktop/flight_data.csv'
    USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (....)