我正在使用Hadoop的map reduce函数编写一个倒排索引创建器。我的输入文件中的一些行已经将字符\ n作为实际字符写入(不是ASCII 10,而是两个实际字符' \'和' n')。由于某种原因,我不明白,这似乎导致地图功能将我的线分成两行。
以下是我的一些文件中的一些示例行。
32155:Wyldwood电台:移动将于5月1日星期五开始,按原计划开始!\ n \ n我们遇到了一些并发症...... http://t.co/g8STpuHn5Q
5:RT @immoumita:#SaveJalSatyagrahi \ nJal Satyagraha'坚持水的真相' https://t.co/x3XgRvCE5H来自@ 4nks
15161:RT @immoumita:#SaveJalSatyagrahi \ nJal Satyagraha'坚持水的真相' https://t.co/x3XgRvCE5H来自@ 4nks
这是输出:
co:78516:tweets0001:30679; 2,...,tweets0001:我们遇到了一些并发症...... http; 1,...
x3XgRvCE5H:2:tweets0000:Jal Satyagraha'坚持水的真相' HTTPS; 2
以下是我的地图缩小:
MAP
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
private final static Text word = new Text();
private final static Text location = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
int colon_index = line.indexOf(":");
if(colon_index > 0)
{
String tweet_num = line.substring(0,colon_index);
line = line.substring(colon_index + 1);
StringTokenizer tokenizer = new StringTokenizer(line," !@$%^&*()-+=\"\\:;/?><.,{}[]|`~");
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
location.set(filename + ":" + tweet_num);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, location);
}
}
}
REDUCE
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
boolean first = true;
int count = 0;
StringBuilder locations = new StringBuilder();
HashMap<String,Integer> frequencies = new HashMap<String, Integer>();
while (values.hasNext()) {
String location = values.next().toString();
if(frequencies.containsKey(location)){
int frequency = frequencies.get(location).intValue() + 1;
frequencies.put(location,new Integer(frequency));
}
else{
frequencies.put(location,new Integer(1));
}
count++;
}
for(String location : frequencies.keySet()){
int frequency = frequencies.get(location).intValue();
if(!first)
locations.append(", ");
locations.append(location);
locations.append(";"+frequency);
first = false;
}
StringBuilder finalString = new StringBuilder();
finalString.append(":"+String.valueOf(count)+": ");
finalString.append(locations.toString());
output.collect(key, new Text(finalString.toString()));
}
}
一般数据流是将每一行映射到{Word,filename:line_number}对,然后通过计算它们显示的频率来减少这些对。输出应为:
Word - &gt;:occurences:filename1:line_number:occurences_on_this_line,filename2 ....
map reduce部分工作正常,你甚至可以从我的例子中看到第5行和第15161行的推文都包含字符串x3XgRvCE5H
,并且,因为我的Mapper在追加一行之前查找冒号数字和这两条推文包含相同的文字,它们都映射到相同的索引位置,给出一个&#34;频率&#34;价值2。
所以,我的问题是:如何让Hadoop的输入格式不读取字符&#34; \ n&#34;作为换行符?毕竟,它们不是ASCII 10,实际的新行,换行符,而是两个单独的字符。
答案 0 :(得分:1)
您必须扩展FileInputFormat
并编写一个新类来覆盖该行为。例如:
public class ClientTrafficInputFormat extends FileInputFormat {
@Override
public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
return new ClientTrafficRecordReader();
}
}
还应覆盖RecordReader
public class ClientTrafficRecordReader extends
RecordReader<ClientTrafficKeyWritable, ClientTrafficValueWritable> {
...
private LineRecordReader reader = new LineRecordReader(); // create your own RecordReader this is where you have to mention not to use '\n' but it should be read as "\"and "n"
@Override
public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException,
InterruptedException {
reader.initialize(is, tac);
}
...
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
//customize your input
}