Question

我最近在一次采访中被问到以下问题，这听起来很简单，但最后对我来说却很棘手。

所有文件夹及其子文件夹中都有很多文件。每   文件的每一行都有很多数字。给定一个根文件夹，我需要找到100个最大的文件夹   所有这些文件中的编号。我想出了以下解决方案：


逐行读取所有文件。

将每个数字存储在数组列表中。

按降序排列。

现在从列表中获取前k个数字。

但是面试官问我，这样做的时间复杂度是多少。我说过，由于我们正在对其进行排序，因此它将变为O（nlogn），然后他问我们如何才能改善以下程序？由于您将所有内容存储在内存中然后进行排序-如果无法将所有内容都放在内存中怎么办？

那时我很困惑，无法弄清楚是否有更好的/有效的方法来解决以下问题。他要我写高效的代码。有没有更好的方法可以做到这一点？

下面是我想出的原始代码：

  private static final List<Integer> numbers = new ArrayList<>();

  public static void main(String[] args) {
    int k = 100;
    List<Integer> numbers = findKLargest("/home/david");

    // sort in descending order
    Collections.sort(numbers, Collections.reverseOrder());
    List<Integer> kLargest = new ArrayList<>();
    int j = 0;
    // now iterate all the numbers and get the first k numbers from the list
    for (Integer num : numbers) {
      j++;
      kLargest.add(num);
      if (j == k) {
        break;
      }
    }
    // print the first k numbers
    System.out.println(kLargest);
  }

  /**
   * Read all the numbers from all the files and load it in array list
   * @param rootDirectory
   * @return
   */
  private static List<Integer> findKLargest(String rootDirectory) {
    if (rootDirectory == null || rootDirectory.isEmpty()) {
      return new ArrayList<>();
    }

    File file = new File(rootDirectory);
    for (File entry : file.listFiles()) {
      if (entry.isDirectory()) {
        numbers.addAll(findKLargest(entry.getName()));
      } else {
        try (BufferedReader br = new BufferedReader(new FileReader(entry))) {
          String line;
          while ((line = br.readLine()) != null) {
            numbers.add(Integer.parseInt(line));
          }
        } catch (NumberFormatException | IOException e) {
          e.printStackTrace();
        }
      }
    }
    return numbers;
  }

Answer 1

除了存储所有 N 个值（对所有文件中的总数进行计数）并对其进行排序之外，您只能存储100个值-时刻都是最大的值。

此任务的便捷数据结构-priority queue（通常基于binary heap）。创建具有100个第一个值的 min -堆，然后为每个新值检查它是否大于堆顶部。如果是，请删除顶部，然后插入新项目。

空间复杂度为O(K)，时间复杂度为O(NlogK)，此处为K=100，因此复杂度可以评估为O(1)和O(N)（省略常数项）

显示其工作方式的Python示例：

import heapq, random

pq = [random.randint(0, 20) for _ in range(5)]  #initial values
print(pq)
heapq.heapify(pq)                               #initial values ordered in heap
print(pq)
for i in range(5):
    r = random.randint(0, 20)    # add 5 more values
    if r > pq[0]:
        heapq.heappop(pq)
        heapq.heappush(pq, r)
    print(r, pq)

[17, 22, 10, 1, 15]   //initial values
[1, 15, 10, 22, 17]   //heapified, smallest is the left
29 [10, 15, 17, 22, 29]     //29 replaces 1
25 [15, 22, 17, 29, 25]     //25 replaces 10
14 [15, 22, 17, 29, 25]      //14 is too small
8 [15, 22, 17, 29, 25]       //8 is too small
21 [17, 21, 25, 29, 22]     //21 is in the club now

Answer 2

在@MBo中，Java实现如下所示

使用PriorityQueue

使用大小为100的优先级队列创建最小堆

int MAX = 100;
PriorityQueue<Integer> queue = new PriorityQueue<>(MAX);

从文件中读取数字，插入并平衡最小堆。将min-heap中的minValue与newValue进行比较。如果更大，则删除minValue并插入newValue。

public void balanceMinHeap(int newValue) {

    if(queue.size() < MAX) {
        queue.add(newValue);
        return;
    }

    if(queue.peek() < newValue) {
        queue.remove();
        queue.add(newValue);
    }

}

现在您可以按升序从最小堆中获得100个最大数字

    for(int i=0;i<100;i++) {
        System.out.println(queue.remove());
    }

如果您想按降序排列相同的100个最大数字，只需将相同的队列转换为max-Heap（即再次为PriorityQueue）

Comparator<Integer> desendingOrder = new Comparator<Integer>() {
    public int compare(Integer x, Integer y) {
         return y - x;
     }
};

PriorityQueue<Integer> maxHeap = new PriorityQueue<>(MAX, desendingOrder);

或者仅使用内置Collections.reverseOrder

PriorityQueue<Integer> maxHeap = new PriorityQueue<>(MAX, Collections.reverseOrder());

从不同文件夹中存在的所有文件中找到100个最大数字

2 个答案: