Question

我目前正在开发一个小程序来连接两个文本文件（类似于数据库连接）。一个文件可能如下所示：

第二个是类似的：


    hsdf87347
    7485C5
    rhdff
    23487
    948FD4

两个文件都有超过1.000.000行，并且不限于特定数量的字符。我想要做的是在两个文件中找到所有匹配的行。

我尝试了一些东西，数组，向量，列表 - 但我目前正在努力决定什么是最好的（最快和记忆容易）方式。

我的代码目前看起来像：



    #include iostream>
    #include fstream>
    #include string>
    #include ctime>
    #include list>
    #include algorithm>
    #include iterator>
    using namespace std;


    int main()
    {

        string line;

        clock_t startTime = clock();

        list data;
        //read first file
        ifstream myfile ("test.txt");
        if (myfile.is_open())
        {
            for(line; getline(myfile, line);/**/){
                data.push_back(line);
            }

            myfile.close();
        }

        list data2;
        //read second file
        ifstream myfile2 ("test2.txt");
        if (myfile2.is_open())
        {
            for(line; getline(myfile2, line);/**/){
                data2.push_back(line);
            }

            myfile2.close();
        }
        else cout  data2[k], k++
        //if data[j] > a;

        return 0;


    }

我的想法是：使用向量，对元素的随机访问非常困难，跳转到下一个元素不是最佳的（不在代码中，但我希望你明白这一点）。使用push_back并逐行添加行也需要很长时间才能将文件读入向量。对于阵列，随机访问更容易，但是将> 1.000.000记录读入阵列将是非常紧凑的记忆并且也需要很长时间。列表可以更快地读取文件，随机访问再次昂贵。

最终，我不仅要查找完全匹配，还要查找每行的前4个字符。

你能帮我决定一下，最有效的方法是什么？我已经尝试了数组，向量和列表，但到目前为止我对速度不满意。有没有其他方法可以找到比赛，我没有考虑过？我很高兴完全改变代码，期待任何建议！

非常感谢！

编辑：输出应列出匹配的值/行。在此示例中，输出应该如下所示：


    7485C5
    948FD4

Answer 1

读取200万行不会太慢，可能会减慢的是你的比较逻辑：

使用：std::intersection

data1.sort(data1.begin(), data1.end()); // N1log(N1)
data2.sort(data2.begin(), data2.end()); // N2log(N2)

std::vector<int> v; //Gives the matching elements

std::set_intersection(data1.begin(), data1.end(),
                      data2.begin(), data2.end(),
                      std::back_inserter(v)); 

 // Does 2(N1+N2-1) comparisons (worst case)

您也可以尝试使用std::set并从两个文件中插入行，结果集只包含唯一元素。

Answer 2

一种解决方案是立即读取整个文件。

使用istream :: seekg和istream :: tellg来计算两个文件的大小。分配足够大的字符数组以存储它们。使用istream :: read将这两个文件读取到适当位置的数组中。

Here is an example of the above functions.

Answer 3

如果第一个文件中的值是唯一的，则在利用集合的O(nlogn)特征时，这变得微不足道。以下将第一个文件中的所有行作为命令行参数传递给集合，然后对第二个文件中的每一行执行O(logn)搜索。

编辑：添加了4个字符的前导码搜索。为此，该集合仅包含每行的前四个字符，而从第二个字符开始的搜索仅查找每个搜索行的前四个字符。如果匹配，则完整打印第二文件行。完整打印第一个文件将更具挑战性。

#include <iostream>
#include <fstream>
#include <string>
#include <set>

int main(int argc, char *argv[])
{
    if (argc < 3)
        return EXIT_FAILURE;

    // load set with first file
    std::ifstream inf(argv[1]);
    std::set<std::string> lines;
    std::string line;
    for (unsigned int i=1; std::getline(inf,line); ++i)
        lines.insert(line.substr(0,4));

    // load second file, identifying all entries.
    std::ifstream inf2(argv[2]);
    while (std::getline(inf2, line))
    {
        if (lines.find(line.substr(0,4)) != lines.end())
            std::cout << line << std::endl;
    }

    return 0;
}

C ++将文件读入Array / List / Vector

3 个答案: