如何在PIG中导入/加载.csv文件?

时间:2014-09-01 03:55:35

标签: hadoop apache-pig bigdata hadoop-streaming

让我们假设有一个文本文件标签限制(datetemp.txt)我想在pig中加载这个文本文件进行处理但是当我在行下面输入时它给我的错误如下:

咕噜> inputfile = load' /training/pig/datetemp.txt'使用PigStorage()As(EventID:chararray,eventdate:chararray,count:int);

咕噜> dump inputfile;

2014-09-06 08:41:23,527 [main] INFO org.apache.pig.tools.pigstats.ScriptState - 脚本中使用的Pig功能:UNKNOWN 2014-09-06 08:41:23,544 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 文件级联阈值:100乐观?假 2014-09-06 08:41:23,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化前的MR计划大小:1 2014-09-06 08:41:23,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化后的MR计划大小:1 2014-09-06 08:41:23,551 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig脚本设置被添加到作业 2014-09-06 08:41:23,551 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent未设置,设置为默认值0.3 2014-09-06 08:41:23,552 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 创建jar文件Job2739171785773930333.jar 2014-09-06 08:42:39,608 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar文件Job2739171785773930333.jar created 2014-09-06 08:42:39,612 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 设置单店存储作业 2014-09-06 08:42:39,619 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1个map-reduce等待提交的作业。 2014-09-06 08:42:39,630 [Thread-12] WARN org.apache.hadoop.mapred.JobClient - 使用GenericOptionsParser解析参数。应用程序应该实现相同的工具。 2014-09-06 08:42:39,891 [Thread-12] INFO org.apache.hadoop.mapred.JobClient - 清理临时区域hdfs://192.168.195.130:8020 / var / lib / hadoop-hdfs / cache /mapred/mapred/staging/training/.staging/job_201408292336_0009 2014-09-06 08:42:39,891 [Thread-12] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:training(auth:SIMPLE)cause:org.apache.pig.backend.executionengine.ExecException:ERROR 2118:输入路径不存在:hdfs://192.168.195.130:8020 / training / pig / datetemp.txt 2014-09-06 08:42:40,119 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0%完成 2014-09-06 08:42:40,125 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 作业null失败!停止运行所有相关作业 2014-09-06 08:42:40,125 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100%完成 2014-09-06 08:42:40,131 [main]错误org.apache.pig.tools.pigstats.SimplePigStats - 错误2997:无法从后端错误重新创建异常:org.apache.pig.backend.executionengine.ExecException:ERROR 2118:输入路径不存在:hdfs://192.168.195.130:8020 / training / pig / datetemp.txt     在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:285)     在org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1014)     在org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1031)     在org.apache.hadoop.mapred.JobClient.access $ 600(JobClient.java:172)     在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:943)     在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:896)     at java.security.AccessController.doPrivileged(Native Method)     在javax.security.auth.Subject.doAs(Subject.java:396)     在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)     在org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)     在org.apache.hadoop.mapreduce.Job.submit(Job.java:531)     在org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:318)     在org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.startReadyJobs(JobControl.java:238)     在org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:269)     在java.lang.Thread.run(Thread.java:662)     在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher $ 1.run(MapReduceLauncher.java:260) 引起:org.apache.hadoop.mapreduce.lib.input.InvalidInputException:输入路径不存在:hdfs://192.168.195.130:8020 / training / pig / datetemp.txt     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)     at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)     在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:273)     ......还有15个

2014-09-06 08:42:40,131 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce work(s)failed! 2014-09-06 08:42:40,135 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - 脚本统计:

HadoopVersion PigVersion UserId StartedAt FinishedAt功能 2.0.0-cdh4.1.1 0.10.0-cdh4.1.1 training 2014-09-06 08:41:23 2014-09-06 08:42:40 UNKNOWN

失败!

失败的工作: JobId别名功能消息输出 N / A inputfile MAP_ONLY消息:org.apache.pig.backend.executionengine.ExecException:错误2118:输入路径不存在:hdfs://192.168.195.130:8020 / training / pig / datetemp.txt     在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:285)     在org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1014)     在org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1031)     在org.apache.hadoop.mapred.JobClient.access $ 600(JobClient.java:172)     在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:943)     在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:896)     at java.security.AccessController.doPrivileged(Native Method)     在javax.security.auth.Subject.doAs(Subject.java:396)     在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)     在org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)     在org.apache.hadoop.mapreduce.Job.submit(Job.java:531)     在org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:318)     在org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.startReadyJobs(JobControl.java:238)     在org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:269)     在java.lang.Thread.run(Thread.java:662)     在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher $ 1.run(MapReduceLauncher.java:260) 引起:org.apache.hadoop.mapreduce.lib.input.InvalidInputException:输入路径不存在:hdfs://192.168.195.130:8020 / training / pig / datetemp.txt     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)     at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)     在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:273)     ......还有15个     HDFS://192.168.195.130:8020 / TMP / TEMP-1004538676 / tmp1582688785,

输入(S): 无法从" /training/pig/datetemp.txt"

中读取数据

输出(一个或多个): 无法在" hdfs://192.168.195.130:8020 / tmp / temp-1004538676 / tmp1582688785"

中生成结果

计数器: 总记录:0 写入的总字节数:0 可溢出内存管理器溢出计数:0 积极散布的行李总数:0 主动泄漏的总记录数:0

工作DAG: 空

2014-09-06 08:42:40,135 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败! 2014-09-06 08:42:40,142 [main] ERROR org.apache.pig.tools.grunt.Grunt - 错误1066:无法打开别名inputfile的迭代器 日志文件的详细信息:/home/training/pig_1410006833865.log

请在这帮帮我.. !!

7 个答案:

答案 0 :(得分:2)

PigStorage区分大小写。使用PigStorage而不是pigstorage。

答案 1 :(得分:0)

您的问题头条新闻说您正在尝试加载CSV文件。为此,我在using org.apache.pig.piggybank.storage.CSVExcelStorage()语句中LOAD发表了{{1}}祝你好运,如https://martin.atlassian.net/wiki/x/WYBmAQ所示。

答案 2 :(得分:0)

为什么不写 PigStorage(' \ t'),因为您已经提到已经有制表符分隔文件而不是 PigStorage()

提到代码 -

  

咕噜> inputfile = load' /training/pig/datetemp.txt'使用PigStorage()   As(EventID:chararray,eventdate:chararray,count:int);

可能这可能会解决您的问题。

让我知道这是否是别的。

答案 3 :(得分:0)

hdfs://192.168.195.130:8020/training/pig/datetemp.txt 

你的hdfs中找不到文件!确保输入文件放在上面的位置。

答案 4 :(得分:0)

您是否检查过输入路径是否存在?

尝试:

fs -ls /training/pig/ in Grunt Shell

如果它在列表中显示datetemp.txt,那么它将起作用,否则将提供正确的输入路径

答案 5 :(得分:0)

日志清楚地告诉错误。

org.apache.pig.backend.executionengine.ExecException: ERROR 2118:输入路径不存在:hdfs://192.168.195.130:8020 / training / pig / datetemp.txt

您是否可以检查HDFS中是否存在该文件? 您还可以检查您的猪是否正在mapreduce模式或本地模式下运行。

答案 6 :(得分:0)

您可以指定','在PigStorage类中读取CSV文件。

查询看起来像:

grunt> inputfile= load '/training/pig/datetemp.txt' using PigStorage(',') As (EventID: chararray,eventdate: chararray,count:int);

grunt> dump inputfile;

并确保您有文件' /training/pig/datetemp.txt'在HDFS上。 要测试运行:hadoop fs -ls /training/pig/datetemp.txt