尝试执行Pig拉丁脚本时遇到异常

时间:2018-07-22 14:45:56

标签: hadoop mapreduce apache-pig

我正在独自学习Pig,在尝试探索数据集时遇到了异常。脚本中有什么问题以及原因:

<section id="about">
 <div class="container">
   <h1>About</h1>
   <p>Lorem ipsum dolor sit amet</p>
   <img src="https://picsum.photos/250/250">
 </div>
</section>

在MAP Reduce执行结束时,出现以下错误。

movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high   = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE movies_data.title,movies_data.year,movies_data.rating,movies_data.duration;
DUMP high_rated;

2 个答案:

答案 0 :(得分:1)

首先,让我们看看如何解决您的问题。您无需使用别名访问您的字段。您的第三行可能很简单:

high_rated = FOREACH high GENERATE title, year, rating, duration;

如果出于某种原因要使用别名,则应使用引用运算符(::),如ERROR建议所示。然后您的行将如下所示:

high_rated = FOREACH high GENERATE movies_data::title, movies_data::year, movies_data::rating, movies_data::duration;

接下来,让我们尝试了解错误消息背后的确切原因。当您尝试使用点运算符(。)访问字段时,pig将假定别名为标量(别名只有一行)。由于您的别名有多行,因此抱怨。您可以在这里阅读有关Pig中标量的更多信息:https://issues.apache.org/jira/browse/PIG-1434

在JIRA的发行说明部分中,您会在最后注意到,预期的错误消息与您遇到的错误匹配:

If a relation contains more than single tuple, a runtime error is generated: 
"Scalar has more than one row in the output"

答案 1 :(得分:0)

这对您有效,没有错误。

movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high   = FILTER movies_data by rating > 4.0;
 high_rated = FOREACH high GENERATE title,year,rating,duration;
DUMP high_rated;

FILTER命令允许所有满足过滤条件的列记录。