CSV将大量数据加载到Pig中

时间:2016-03-26 21:45:27

标签: csv apache-pig

我在pig中使用此查询从CSV文件中加载数据,其中包含50000条记录。

A = LOAD '/home/user/q2.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') as (Id:chararray,
PostTypeId:chararray, 
AcceptedAnswerId:chararray, 
ParentId:chararray, 
CreationDate:chararray, 
DeletionDate:chararray, 
Score:chararray, 
ViewCount:chararray, 
Body:chararray, 
OwnerUserId:chararray, 
OwnerDisplayName:chararray, 
LastEditorUserId:chararray, 
LastEditorDisplayName:chararray, 
LastEditDate:chararray, 
LastActivityDate:chararray, 
Title:chararray, 
Tags:chararray, 
AnswerCount:chararray, 
CommentCount:chararray, 
FavoriteCount:chararray, 
ClosedDate:chararray, 
CommunityOwnedDate:chararray);

这是清理\ n&的数据的查询。 ,在身体领域还有更多。

Q2Clean = FOREACH Q2 GENERATE
Id as Id, 
PostTypeId as PostTypeId, 
AcceptedAnswerId as AcceptedAnswerId, 
(chararray)REPLACE(ParentId,'"','')  as ParentId, 
CreationDate as CreationDate, 
(chararray)REPLACE(DeletionDate,'"','') as DeletionDate, 
Score as Score, 
ViewCount as ViewCount,  
(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body, 
OwnerUserId as OwnerUserId, 
(chararray)REPLACE(OwnerDisplayName,'"','') as OwnerDisplayName, 
LastEditorUserId as LastEditorUserId, 
(chararray)REPLACE(LastEditorDisplayName,'"','') as LastEditorDisplayName, 
LastEditDate as LastEditDate, 
LastActivityDate as LastActivityDate, 
(chararray)REPLACE(Title,',','') as Title, 
(chararray)REPLACE(Tags,',','') as Tags, 
AnswerCount as AnswerCount, 
CommentCount as CommentCount, 
FavoriteCount as FavoriteCount, 
(chararray)REPLACE(ClosedDate,'"','') as ClosedDate, 
(chararray)REPLACE(CommunityOwnedDate,'"','') as CommunityOwnedDate;

现在的问题是当我存储输出时显示617538行。它创建了两个文件。第一个文件有27000条记录,这些记录格式正确,但第二个文件未正确存储。它包含大约610000行和许多行,其中包含just。如何正确加载数据,以便输出显示50000而不是617538行。

Here's the load status

1 个答案:

答案 0 :(得分:0)

我认为问题出在脚本的下方。

(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body, 

您必须添加另一个反斜杠来替换'\ n'

(chararray)REPLACE(REPLACE(Body,'\\n',''),',','')as Body,