所以我正在做一个小测试程序,只是为了获得hadoops inputformat类的挂起。我已经构建了一个单词搜索,它将行作为值并逐行搜索。我想知道我是否可以让hadoop逐字逐句地接受值,hadoop似乎不喜欢这样并且使用默认映射器继续给我结果。我甚至都没有调用我的mappers初始化函数。
我知道我的记录阅读器已被调用,并且它正在或多或少地做它应该做的事情,我很确定记录阅读器的输出是我的映射器正在搜索的,所以为什么hadoop决定不叫它?
以下是相关代码
输入格式类
public class WordReader extends FileInputFormat<Text, Text> {
@Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) {
return new MyWholeFileReader();
}
}
记录阅读器
public class MyWholeFileReader extends RecordReader<Text, Text> {
private long start;
private LineReader in;
private Text key = null;
private Text value = null;
private ArrayList<String> outputvalues;
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
outputvalues = new ArrayList<String>();
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
start = split.getStart();
final Path file = split.getPath();
// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new LineReader(fileIn, job);
if (key == null) {
key = new Text();
}
key.set(split.getPath().getName());
if (value == null) {
value = new Text();
}
}
public boolean nextKeyValue() throws IOException {
if (outputvalues.size() == 0) {
Text buffer = new Text();
int i = in.readLine(buffer);
String str = buffer.toString();
for (String vals : str.split(" ")) {
outputvalues.add(vals);
}
if (i == 0 || outputvalues.size() == 0) {
key = null;
value = null;
return false;
}
}
value.set(outputvalues.remove(0));
System.out.println(value.toString());
return true;
}
@Override
public Text getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
/**
*
* Get the progress within the split
*/
public float getProgress() {
return 0.0f;
}
public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}
映射
public class WordSearchMapper extends Mapper<Text, Text, OutputCollector<Text,IntWritable>, Reporter> {
static String keyword;
BloomFilter<String> b;
public void configure(JobContext jobConf) {
keyword = jobConf.getConfiguration().get("keyword");
System.out.println("keyword>> " + keyword);
b = new BloomFilter<String>(.01,10000);
b.add(keyword);
System.out.println(b.getExpectedBitsPerElement());
}
public void map(Text key, Text value, OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int wordPos;
System.out.println("value.toString()>> " + value.toString());
System.out.println(((FileSplit) reporter.getInputSplit()).getPath()
.getName());
String[] tokens = value.toString().split("[\\p{P} \\t\\n\\r]");
for (String st :tokens) {
if (b.contains(st)) {
if (value.toString().contains(keyword)) {
System.out.println("Found one");
wordPos = ((Text) value).find(keyword);
output.collect(value, new IntWritable(wordPos));
}
}
}
}
}
驱动:
public class WordSearch {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf,"WordSearch");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(WordSearchMapper.class);
job.setInputFormatClass( WordReader.class);
job.setOutputFormatClass(TextOutputFormat.class);
conf.set("keyword", "the");
FileInputFormat.setInputPaths(job, new Path("search.txt"));
FileOutputFormat.setOutputPath(job, new Path("outputs"+System.currentTimeMillis()));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
答案 0 :(得分:0)
我想通了......这就是为什么hadoop需要停止支持多个版本本身或为什么我应该停止干扰多个教程在一起。原来我的映射器需要像我这样设置映射器和记录阅读器设置进行交互。
'公共类WordSearchMapper扩展了Mapper {static String keyword;`
我只是在查看了我的导入后才意识到这一点,并且看到记者来自org.apache.hadoop.mapred包而不是org.apache.hadoop.mapreduce -