我想了解如何在pig脚本中集成调用mapreduce作业。
我提到了这个链接 https://wiki.apache.org/pig/NativeMapReduce
但我不知道如何理解它是如何理解哪个是我的mapper或reducer代码。解释不是很清楚。
如果有人可以用一个例子来说明它,那将会很有帮助。
先谢谢, 干杯:)
答案 0 :(得分:4)
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
在上面的示例中,pig会将A
的输入数据存储到inputDir
中,并从outputDir
加载作业的输出数据。
此外,HDFS中有一个名为wordcount.jar
的jar,其中有一个名为org.myorg.WordCount
的类,其主要类负责设置映射器和缩减器,输入和输出等。
您还可以通过hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir
调用mapreduce作业。
答案 1 :(得分:0)
默认情况下,pig会预测map / reduce程序。但是hadoop带有默认的mapper / reducer实现; Pig使用的 - 当没有识别map reduce class时。
此外,Pig使用Hadoop的属性及其特定属性。尝试设置,在猪脚本中的属性下面,它也应该被Pig选中。
SET mapred.mapper.class="<fully qualified classname for mapper>"
SET mapred.reducer.class="<fully qualified classname for reducer>"
同样可以使用-Dmapred.mapper.class
选项设置。综合列表为here
根据您的hadoop安装,属性也可以:
mapreduce.map.class
mapreduce.reduce.class
仅供参考......
hadoop.mapred已被弃用。 0.20.1之前的版本使用mapred。 之后的版本使用mapreduce。
此外,pig有自己的一组属性,可以使用命令pig -help properties
e.g. in my pig installation, below are the properties:
The following properties are supported:
Logging:
verbose=true|false; default is false. This property is the same as -v switch
brief=true|false; default is false. This property is the same as -b switch
debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
aggregate.warning=true|false; default is true. If true, prints count of warnings
of each type rather than logging each warning.
Performance tuning:
pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner=true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery=true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression=true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination=true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.
pig.exec.mapPartAgg=true|false. Default is false.
Determines if partial aggregation is done within map phase,
before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records
by this factor, it gets disabled.
Miscellaneous:
exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.