Question

我的算法将通过大型数据集读取一些文本文件并搜索这些行中的特定术语。我用Java实现它，但我不想发布代码，所以它看起来我不是在寻找有人为我实现它，但它确实需要很多帮助！这不是我的项目计划，但数据集是巨大的，所以老师告诉我，我必须这样做。

编辑（我没有说明我的previos版本）我拥有的数据集是在Hadoop集群上，我应该进行MapReduce实现

我正在阅读有关MapReduce的内容，并且认为我首先执行标准实现，然后使用mapreduce执行此操作会更容易/更简单。但是没有发生，因为算法非常愚蠢而且没什么特别的，而且地图缩小了...我无法将它包裹起来。

所以这里是我算法的伪代码

LIST termList   (there is method that creates this list from lucene index)
FOLDER topFolder

INPUT topFolder
IF it is folder and not empty
    list files (there are 30 sub folders inside)
    FOR EACH sub folder
        GET file "CheckedFile.txt"
        analyze(CheckedFile)
    ENDFOR
END IF


Method ANALYZE(CheckedFile)

read CheckedFile
WHILE CheckedFile has next line
    GET line
    FOR(loops through termList)
            GET third word from line
          IF third word = term from list
        append whole line to string buffer
    ENDIF
ENDFOR
END WHILE
OUTPUT string buffer to file

另外，正如您所看到的，每次调用“analyze”时，都必须创建新文件，我知道map reduce很难写入多个输出???

我理解mapreduce直觉，我的例子似乎非常适合mapreduce，但是当谈到这样做时，显然我不够了，我很生气！

请帮忙。

Answer 1

您可以使用空的reducer，并对作业进行分区，以便为每个文件运行一个映射器。每个映射器都会在输出文件夹中创建自己的输出文件。

Answer 2

使用一些不错的Java 6并发功能，特别是Future，Callable和ExecutorService，可以轻松实现Map Reduce。

我创建了一个Callable，它将以您指定的方式分析文件

public class FileAnalyser implements Callable<String> {

  private Scanner scanner;
  private List<String> termList;

  public FileAnalyser(String filename, List<String> termList) throws FileNotFoundException {
    this.termList = termList;
    scanner = new Scanner(new File(filename));
  }

  @Override
  public String call() throws Exception {
    StringBuilder buffer = new StringBuilder();
    while (scanner.hasNextLine()) {
      String line = scanner.nextLine();
      String[] tokens = line.split(" ");
      if ((tokens.length >= 3) && (inTermList(tokens[2])))
        buffer.append(line);
    }
    return buffer.toString();
  }

  private boolean inTermList(String term) {
    return termList.contains(term);
  }
}

我们需要为找到的每个文件创建一个新的callable，并将其提交给executor服务。提交的结果是Future，我们稍后可以使用它来获取文件解析的结果。

public class Analayser {

  private static final int THREAD_COUNT = 10;

  public static void main(String[] args) {

    //All callables will be submitted to this executor service
    //Play around with THREAD_COUNT for optimum performance
    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);

    //Store all futures in this list so we can refer to them easily
    List<Future<String>> futureList = new ArrayList<Future<String>>();

    //Some random term list, I don't know what you're using.
    List<String> termList = new ArrayList<String>();
    termList.add("terma");
    termList.add("termb");

    //For each file you find, create a new FileAnalyser callable and submit
    //this to the executor service. Add the future to the list
    //so we can check back on the result later
    for each filename in all files {
      try {
        Callable<String> worker = new FileAnalyser(filename, termList);
        Future<String> future = executor.submit(worker);
        futureList.add(future);
      }
      catch (FileNotFoundException fnfe) {
        //If the file doesn't exist at this point we can probably ignore,
        //but I'll leave that for you to decide.
        System.err.println("Unable to create future for " + filename);
        fnfe.printStackTrace(System.err);
      }
    }

    //You may want to wait at this point, until all threads have finished
    //You could maybe loop through each future until allDone() holds true
    //for each of them.

    //Loop over all finished futures and do something with the result
    //from each
    for (Future<String> current : futureList) {
      String result = current.get();
      //Do something with the result from this future
    }
  }
}

我的例子远非完整，远非有效。我没有考虑样本大小，如果它真的很大，你可以继续循环到futureList，删除已经完成的元素，类似于：

while (futureList.size() > 0) {
      for (Future<String> current : futureList) {
        if (current.isDone()) {
          String result = current.get();
          //Do something with result
          futureList.remove(current);
          break; //We have modified the list during iteration, best break out of for-loop
        }
      }
}

或者，您可以实现生产者 - 消费者类型设置，其中生产者将可调用者提交给执行者服务并生成未来，消费者获取未来的结果并丢弃然后将来。

这可能要求产品和消费者本身就是线程，以及用于添加/删除期货的同步列表。

有任何问题请问。

需要帮助使用map Hadoop MapReduce实现此算法

2 个答案: