Question

我想解决使用std::unordered_map从大文件中删除重复行的问题，以存储先前是否遇到过每一行的地图。

要解决文件太大的问题，我希望地图中的密钥为std::string但不是要将其存储在内存中，而是将其存放在文件中为实际存储值，然后比较器只读取该位置的一行并与当前键进行比较。

例如，如果字符串为"abcd"，则密钥为"abcd"但在确定以前不存在于地图中后，它将存储为36例如，36是文件中"abcd"的起始位置。

有没有办法可以使用内置的std::unordered_map （或其他散列图数据结构）来实现这一点而不实现自己的？

另外，如果没有，我自己实施它的最佳方式是什么？我在考虑使用std::unordered_map<size_t, vector<int>> size_t key是我的字符串的std::hash，向量存储文件中的位置，我可以readline进行比较。还有更好的方法吗？

Answer 1

假设您有一个名为Stuff的类，其对象只存储size_t，但可以找到实际的文本行（如您所述）：

struct Stuff // the naming here is arbitrary and the code illustrative
{
    static WhateverYouNeedToReadRealRata same_to_all_stuff;
    size_t pos;
    std::string getText() const
    {
        return same_to_all_stuff.read_line_somehow_for(pos);
    }
};

然后你写自定义哈希：

struct HashStuff
{
    size_t operator()(Stuff const& stuff) const
    {
        return std::hash<std::string>()(stuff.getText());
    }
};

然后你编写自定义比较器：

struct CompareStuff
{
    bool operator()(Stuff const& left, Stuff const& right) const
    {
        return left.getText() == right.getText();
    }
};

那么你可以设置你的Stuff并实例化你的unordered_set：

Stuff::same_to_all_stuff = yourSpecialCase(); 
std::unordered_set<Stuff,HashStuff,CompareStuff> stuffSet;

所以Q.E.D.使用自定义比较器和hasher是微不足道的？

Answer 2

我在这里发布我的解决方案，以防它对任何人有帮助。这是基于Oo Tiib在上面的答案中给出的想法。

首先是两个类，Line表示该行。

class Line {
    streampos pos_;
    ifstream &file_;
    mutable streampos tpos_;
    mutable ios_base::iostate state_;

    void SavePos(streampos pos) const {
        tpos_ = file_.tellg();
        state_ = file_.rdstate();
        file_.clear();
        file_.seekg(pos);
    }

    void RestorePos() const {
        file_.setstate(state_);
        file_.seekg(tpos_);
    }
public:
    Line(ifstream &f, streampos pos): pos_(pos), file_(f) { }

    string GetText() const {
        string line;
        SavePos(pos_);
        getline(file_, line);
        RestorePos();
        return line;
    }

    const bool operator==(const Line& other) const {
        return (this->GetText() == other.GetText());
    }
};

然后，HashLine，读取该行并将其散列为字符串的仿函数。

class HashLine {
public:
    const size_t operator() (const Line& l) const {
        return std::hash<string>()(l.GetText());
    }
};

最后是rm_dups函数，它创建哈希表并使用上面的类来删除重复的行：

int rm_dups(const string &in_file, const string &out_file) {
    string line;
    unordered_set<Line, HashLine> lines;
    ifstream file(in_file);
    ofstream out(out_file);
    if (!file || !out) {
        return -1;
    }
    streampos pos = file.tellg();
    while (getline(file, line)) {
        Line l(file, pos); 
        if (lines.find(l) == lines.end()) {
            // does not exist so far, add this new line
            out << l.GetText() << '\n';
            lines.insert(l);
        }
        pos = file.tellg();
    }
    return 0;
}

unordered_map中的C ++远程密钥

2 个答案: