Question

我正在使用Hadoop的map reduce函数编写一个倒排索引创建器。我的输入文件中的一些行已经将字符\ n作为实际字符写入（不是ASCII 10，而是两个实际字符＆＃39; \＆＃39;和＆＃39; n＆＃39;）。由于某种原因，我不明白，这似乎导致地图功能将我的线分成两行。

以下是我的一些文件中的一些示例行。

32155：Wyldwood电台：移动将于5月1日星期五开始，按原计划开始！\ n \ n我们遇到了一些并发症...... http://t.co/g8STpuHn5Q

5：RT @immoumita：#SaveJalSatyagrahi \ nJal Satyagraha＆＃39;坚持水的真相＆＃39; https://t.co/x3XgRvCE5H来自@ 4nks
     15161：RT @immoumita：#SaveJalSatyagrahi \ nJal Satyagraha＆＃39;坚持水的真相＆＃39; https://t.co/x3XgRvCE5H来自@ 4nks

这是输出：

co：78516：tweets0001：30679; 2，...，tweets0001：我们遇到了一些并发症...... http; 1，...

x3XgRvCE5H：2：tweets0000：Jal Satyagraha＆＃39;坚持水的真相＆＃39; HTTPS; 2

以下是我的地图缩小：

MAP

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
     private final static Text word = new Text();
   private final static Text location = new Text();

     public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

     String line = value.toString();

     int colon_index = line.indexOf(":");
     if(colon_index > 0)
     {
       String tweet_num = line.substring(0,colon_index);
       line = line.substring(colon_index + 1);

       StringTokenizer tokenizer = new StringTokenizer(line," !@$%^&*()-+=\"\\:;/?><.,{}[]|`~");
       FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
       String filename = fileSplit.getPath().getName();
       location.set(filename + ":" + tweet_num);
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         output.collect(word, location);
       }
     }
}

REDUCE

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
     public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
       boolean first = true;
     int count = 0;
     StringBuilder locations = new StringBuilder();
     HashMap<String,Integer> frequencies = new HashMap<String, Integer>();


       while (values.hasNext()) {
        String location = values.next().toString();
        if(frequencies.containsKey(location)){
          int frequency = frequencies.get(location).intValue() + 1;
          frequencies.put(location,new Integer(frequency));
        }
        else{
          frequencies.put(location,new Integer(1));
        }
        count++;
       }
     for(String location : frequencies.keySet()){
       int frequency = frequencies.get(location).intValue();
       if(!first)
        locations.append(", ");
       locations.append(location);
       locations.append(";"+frequency);
       first = false;
     }
     StringBuilder finalString = new StringBuilder();
     finalString.append(":"+String.valueOf(count)+": ");
     finalString.append(locations.toString());
       output.collect(key, new Text(finalString.toString()));
     }
   }

一般数据流是将每一行映射到{Word，filename：line_number}对，然后通过计算它们显示的频率来减少这些对。输出应为：

Word - ＆gt;：occurences：filename1：line_number：occurences_on_this_line，filename2 ....

map reduce部分工作正常，你甚至可以从我的例子中看到第5行和第15161行的推文都包含字符串x3XgRvCE5H，并且，因为我的Mapper在追加一行之前查找冒号数字和这两条推文包含相同的文字，它们都映射到相同的索引位置，给出一个＆＃34;频率＆＃34;价值2。

所以，我的问题是：如何让Hadoop的输入格式不读取字符＆＃34; \ n＆＃34;作为换行符？毕竟，它们不是ASCII 10，实际的新行，换行符，而是两个单独的字符。

Answer 1

您必须扩展FileInputFormat并编写一个新类来覆盖该行为。例如：

public class ClientTrafficInputFormat extends FileInputFormat {

    @Override
    public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {

        return new ClientTrafficRecordReader();
    }

}

还应覆盖RecordReader

public class ClientTrafficRecordReader extends
        RecordReader<ClientTrafficKeyWritable, ClientTrafficValueWritable> {

    ...

    private LineRecordReader reader = new LineRecordReader(); // create your own RecordReader this is where you have to mention not to use '\n' but it should be read as "\"and "n"

    @Override
    public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException,
            InterruptedException {

        reader.initialize(is, tac);

    }
     ...
    @Override
public boolean nextKeyValue() throws IOException, InterruptedException {
      //customize your input 
 }

如何让hadoop忽略输入文件中的\ n字符？

1 个答案: