Question

正如标题所述，我正在尝试在大量字符串中找到N个最常见的K长度子字符串及其频率。从文件中逐行读取字符串。（大约有500万行）。例如，如果输入文件是

TTTTTGCAG

GCAGGTTTT

并且K = 4，N = 2，那么输出应该是

TTTT-3次出现

GCAG-2次出现

样本文件由DNA序列组成。但是，我想总结一个广义的解决方案。

到目前为止，我所做的是：

将所有行读入std::vector<std::string>
初始化哈希图std::unoredered_map<std::string_view, unsigned int>
为每一行获取所有line.length()-K+1子字符串。
对于每个子字符串，如果已经在我们的地图增量中，则为频率，否则，将其插入。
将地图的所有条目传输到std::multimap<unsigned int, std::string_view>，并获取最后N个值并打印出来。

我使用string_view而非strings来更有效地获取子字符串，并且不会浪费每个键的内存。

此方法有效，但我正在尝试找到更理想的解决方案。我认为问题在于输入大小越来越大，在哈希图中插入/搜索的平均时间变成了O(N)而不是O(1)。那是真的，我该怎么做才能改善运行时/内存使用率？

（我也尝试过Tries，但是，即使字母大小为4（A，C，G，T）并且遍历它们以找到N个最频繁的字母，它们也没有记忆效率）

Answer 1

一种可能的方法：

使用unordered_map代替std::vector<std::pair<std::string, int>>，它将按字符串排序。扫描阅读的每一行的所有子字符串时，请使用二进制搜索（std::lower_bound()进行查找），并根据需要进行插入或更新（如果是像DNA这样小的固定字母，您甚至可以生成所有length-K个子字符串提前，并预先填充矢量以避免以后插入开销。

完成后，根据计数以降序对向量进行重新排序... std::partial_sort()在这里真的很方便，因为您只需要第一个N元素：

std::partial_sort(words.begin(), words.begin() + N, words.end(),
                  [](const auto &a, const auto &b){ return a.second > b.second; });

基本上，类似以下内容：

#include <string>
#include <string_view>
#include <iostream>
#include <algorithm>
#include <vector>
#include <utility>
#include <cstdlib>

constexpr std::size_t K = 4;
constexpr std::size_t N = 2;

int main() {
  std::vector<std::pair<std::string, int>> words;
  std::string line;

  while (std::getline(std::cin, line)) {
    auto len = line.size();
    for (auto i = 0U; i < len - K + 1; i += 1) {
      auto word = std::string_view(line.c_str() + i, K);
      auto pos = std::lower_bound(words.begin(), words.end(), word,
                                  [](const auto &a, const auto &b){
                                    return a.first < b;
                                  });
      if (pos == words.end() || pos->first != word) {
        words.emplace(pos, std::string(word), 1);
      } else {
        pos->second += 1;
      }
    }
  }

  auto sort_to = std::min(words.size(), N);
  std::partial_sort(words.begin(), words.begin() + sort_to, words.end(),
                    [](const auto &a, const auto &b){
                      return a.second > b.second;
                    });
  for (auto i = 0U; i < sort_to; i += 1) {
    std::cout << words[i].first << " - " << words[i].second << " occurences\n";
  }

  return 0;
}

在大量字符串中找到N个最常见的长度为K的子字符串

1 个答案: