Question

我正在尝试找到重复的字符串实例，其中我有一个~250万字符串的向量.~

目前我使用的是：

std::vector<string> concatVec; // Holds all of the concatenated strings containing columns C,D,E,J and U.
std::vector<string> dupecheckVec; // Holds all of the unique instances of concatenated columns
std::vector<unsigned int> linenoVec; // Holds the line numbers of the unique instances only

// Copy first element across, it cannot be a duplicate yet
dupecheckVec.push_back(concatVec[0]);
linenoVec.push_back(0);

// Copy across and do the dupecheck
for (unsigned int i = 1; i < concatVec.size(); i++)
{
    bool exists = false;

    for (unsigned int x = 0; x < dupecheckVec.size(); x++)
    {
        if (concatVec[i] == dupecheckVec[x])
        {
            exists = true;
        }
    }

    if (exists == false)
    {
        dupecheckVec.push_back(concatVec[i]);
        linenoVec.push_back(i);
    }
    else
    {
        exists = false;
    }
}

对于小文件来说这很好，但是由于嵌套for循环和dupecheckVec中包含的字符串数量增加，文件大小显然会花费很长时间。

在大文件中执行此操作可能不那么可怕？

Answer 1

如果你不介意重新排序矢量，那么这应该在O(n*log(n))时间内完成：

std::sort(vector.begin(), vector.end());
vector.erase(std::unique(vector.begin(), vector.end()), vector.end());

为了保留顺序，您可以改为使用（行号，字符串*）对的向量：按字符串排序，使用比较字符串内容的比较器进行单一化，最后按行号排序，沿着以下行：

struct pair {int line, std::string const * string};

struct OrderByLine {
    bool operator()(pair const & x, pair const & y) {
        return x.line < y.line;
    }
};

struct OrderByString {
    bool operator()(pair const & x, pair const & y) {
        return *x.string < *y.string;
    }
};

struct StringEquals {
    bool operator()(pair const & x, pair const & y) {
        return *x.string == *y.string;
    }
};

std::sort(vector.begin(), vector.end(), OrderByString());
vector.erase(std::unique(vector.begin(), vector.end(), StringEquals()), vector.end());
std::sort(vector.begin(), vector.end(), OrderByLine());

Answer 2

你可以排序哪个是O（n logn），然后任何相等的元素必须是连续的，这样你就可以检查下一个元素，它只是O（n）。而你天真的解决方案是O（n ^ 2）。

Answer 3

您可以使用哈希表，该哈希表使用字符串作为键，使用整数作为值（计数）。然后迭代字符串列表并将每个字符串的值递增1.最后迭代哈希表并保留这些字符串的计数为1

<强> [UPDATE] 另一种解决方案：

使用带字符串的散列表作为键和vector / array
对于向量中的每个字符串：
- 如果字符串包含在哈希表中[可选：删除条目并继续]
- 否则，使用字符串作为键将当前字符串的索引位置放入哈希表中并继续
完成对哈希表的迭代并使用索引检索唯一字符串

此解决方案为您提供所有字符串的索引，过滤掉重复项。如果您只想要那些没有重复项的字符串，那么如果该字符串已经在hastable中使用，则必须删除哈希表条目。

Answer 4

使用std::unique，请参阅this

检查大型字符串向量中的重复项

4 个答案: