Question

我正在寻找快速，非常快速读取文本文件的方法。我想到了许多解决方案，但无法达到最佳解决方案。让我描述一下我的问题，然后再写我已经尝试过的东西。

问题陈述：

说我有一个10G文本文件，该文件的格式为：

___PART_1___
*1 abc
*2 def
...
<5 million lines of this format>
*5000001 blah

___PART_2___
1 *1:1 *2:2 <value1>
2 *3:1 *4:3 <value2>
3 *4:2 *4:4 <value3>
<another 10 million lines of this format>

在 _PART_1 _ 中，有两列，即ID和NAME。在 _PART_2 _ 中，共有4列，序列号，数据1，数据2，一些值

我们想要的这个大文件是从data1和data2列中获取冒号之前的数据。在这种情况下，我们想要

从 _PART_2 _ 的第一行中，提取* 1和* 2，并从 _PART_1 _ 中获得相应的名称，在这种情况下为abc＆def。从 _PART_2 _ 的第二行中提取* 3和* 4，并从 _PART_1 _ 中获取相应的名称。

这就是我们想要的所有信息。

在得出结论之前要考虑的事情：

在 _PART_1 _ 中，ID可能不是唯一的或连续的，并且行数可以是任意的，500万只是一个数字。

在 _PART_2 _ 中，确定 PART_2 _ 的data1和data2列中的冒号之前的数据将在 _PART_1 _ 中存在一个条目。

到目前为止已尝试：数字1：我尝试将 _PART_1 _ 保留在地图中，但是由于条目数量很大，因此平衡本身会花费很多时间。因此，我在unordered_map上确认了自己。也将为此编写一个良好的哈希函数。然后，每当我到达 _PART_2 _ 时，对该行进行标记，获取第二个/第三个标记，再次对其进行标记并获取数据。最后，在unordered_map中查找它们。使用了boost :: tokenizer到tokenizer。

数字2：除了使用boost :: tokenizer之外，它还与regex_searches一起使用，但它们似乎也很慢。

数字2：使用mmap将文件映射到内存，但由于文件很大，所以我的程序有时内存不足。

代码快照，而不是完整代码：

typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
typedef std::unordered_map<std::string, std::string> m_unordered;
typedef std::unordered_map<std::string, std::string>::iterator m_unordered_itr;

int main() {
  m_unordered un_name_map;
  m_unordered_itr un_name_map_itr;
  boost::char_separator<char> space_sep{" "};
  std::ifstream myfile("file.txt");
  if (myfile.is_open()) {
    std::string line;
    bool part1_starts = 0;
    bool part2_starts = 0;
    while ( std::getline (myfile,line) ) {
      if (line.find("___PART_1___") != std::string::npos) {
        part1_starts = 1;
        continue;
      }
      if (mapping_starts) {
        tokenizer tok{line, space_sep};
        tokenizer::iterator it = tok.begin();
        std::string index = *it++;
        std::string value = *it;
        un_name_map.insert(un_name_map.end(), {index, value});
      }
      if (line.find("___PART_2___") != std::string::npos) {
        part2_starts = 1;
        part1_starts = 0;
        continue;
      }
      if (part2_starts) {
        tokenizer tok{line, space_sep};
        tokenizer::iterator it_start = tok.begin();
        // Ignore first token and advance
        std::advance(it_start, 1);

        // Split the second token which is my second column of ___PART_2___             vector<std::string> strs;
        strs.reserve(2);
        boost::split(strs, *it_start, boost::is_any_of(":"));
        un_name_map_itr = un_name_map.find(strs[0]);
        if (un_name_map_itr != un_name_map.end()) {
         std::cout << "1. Name from the map is " << un_name_map_itr->second << std::endl;
        }

        // Split the third token which is my third column of ___PART_2___
        // Similar code as above.
      }
    }
  }
}

我确信，有更好的方法可以解决上述问题。我期待他们所有人。我唯一关心的就是“速度”。如有需要，我很乐意写更多细节。

使用特定格式快速解析原始数据

0 个答案: