我正在编写自己的Pig Store类,我不想将其存储在文件中,我打算将其发送到第三方数据存储(缺少API调用)。
注意:我在Cloudera的VirtualBox图像上运行它。
我编写了我的java类(如下所示)并创建了mystore.jar,我在id.pig脚本中使用了它:
store B INTO 'mylocation' USING MyStore('mynewlocation')
在使用pig运行此脚本时,我看到以下错误: 错误6000: 输出位置验证失败:'file://home/cloudera/test/id.out要关注的更多信息: 输出目录未设置。
or.apache.pig.impl.plan.VisitorException: ERROR 6000:
at or.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileValidator.visit(InputOutputFileValidator.java:95)
请帮忙!
-------------------- MyStore.java ----------------------
public class MyStore extends StoreFunc {
protected RecordWriter writer = null;
private String location = null;
public MyStore () {
location= null;
}
public MyStore (String location) {
this.location= location;
}
@Override
public OutputFormat getOutputFormat() throws IOException {
return new MyStoreOutputFormat(location);
}
@Override
public void prepareToWrite(RecordWriter writer) throws IOException {
this.writer = writer;
}
@Override
public void putNext(Tuple tuple) throws IOException {
//write tuple to location
try {
writer.write(null, tuple.toString());
} catch (InterruptedException e) {
e.printStackTrace();
}
}
@Override
public void setStoreLocation(String location, Job job) throws IOException {
if(location!= null)
this.location= location;
}
}
-------------------- MyStoreOutputFormat.java ----------------------
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.pig.data.Tuple;
public class MyStoreOutputFormat extends
TextOutputFormat<WritableComparable, Tuple> {
private String location = null;
public MyStoreOutputFormat(String location) {
this.location = location;
}
@Override
public RecordWriter<WritableComparable, Tuple> getRecordWriter(
TaskAttemptContext job) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
String extension = location;
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, false);
return new MyStoreRecordWriter(fileOut);
}
protected static class MyStoreRecordWriter extends
RecordWriter<WritableComparable, Tuple> {
DataOutputStream out = null;
public MyStoreRecordWriter(DataOutputStream out) {
this.out = out;
}
@Override
public void close(TaskAttemptContext taskContext) throws IOException,
InterruptedException {
// close the location
}
@Override
public void write(WritableComparable key, Tuple value)
throws IOException, InterruptedException {
// write the data to location
if (out != null) {
out.writeChars(value.toString()); // will be calling API later. let me first dump to the location!
}
}
}
}
我在这里错过了什么吗?
答案 0 :(得分:1)
首先,我认为您应该使用作业配置来存储位置值,而不是实例变量
在计划作业时,会调用setStoreLocation方法中对局部变量“location”的赋值,但是在执行阶段之前可能不会调用getOutputFormat,此时可能不再设置位置变量(新实例)你的班级可能已经创建了。)
如果查看PigStorage.setStoreLocation
的来源,您应该注意到它们将位置存储在作业配置中(第2行):
@Override
public void setStoreLocation(String location, Job job) throws IOException {
job.getConfiguration().set("mapred.textoutputformat.separator", "");
FileOutputFormat.setOutputPath(job, new Path(location));
if( "true".equals( job.getConfiguration().get( "output.compression.enabled" ) ) ) {
FileOutputFormat.setCompressOutput( job, true );
String codec = job.getConfiguration().get( "output.compression.codec" );
try {
FileOutputFormat.setOutputCompressorClass( job, (Class<? extends CompressionCodec>) Class.forName( codec ) );
} catch (ClassNotFoundException e) {
throw new RuntimeException("Class not found: " + codec );
}
} else {
// This makes it so that storing to a directory ending with ".gz" or ".bz2" works.
setCompression(new Path(location), job);
}
}
所以我认为你应该把位置存储在一个工作变量中:
@Override
public void setStoreLocation(String location, Job job) throws IOException {
if(location!= null)
job.getConfiguration().set("mylocation", location);
}
然后可以在createRecordReader方法中提取您的自定义输出格式:
@Override
public RecordWriter<WritableComparable, Tuple> getRecordWriter(
TaskAttemptContext job) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
String extension = conf.get("mylocation");
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, false);
return new MyStoreRecordWriter(fileOut);
}
最后(可能是您看到的错误的实际原因),您的输出格式扩展了TextOutputFormat,并且您在记录编写器中使用getDefaultWorkFile
方法 - 此方法需要知道您输出的位置文件到HDFS,并且您没有在setStoreLocation方法中调用FileOutputFormat.setOutputPath(job, new Path(location));
(请参阅我之前粘贴的PigStorage.setStoreLocation方法)。所以错误是因为它不知道在哪里创建默认工作文件。