Question

这里是参考代码：

struct MyData
{
int ID;

// other members
};

std::vector<MyData> inputData;

std::vector<std::vector<MyData> > outputData = GroupByIDs(inputData);

基本上我想做的是迭代输入数据并按ID将对象分组到一个新的迷你向量中，我将在输出向量中推送。所以最后我会得到一个子向量向量，其中每个子向量包含具有相同ID的对象。

是否有 cookie-cutter最有效的算法设计用于此目的？因为我只能想到高复杂度的算法。

Answer 1

您可以根据ID对元素进行排序，然后使用std::upper_bound查找每个组的结束位置来执行此操作：

例如：

#include <string>
#include <vector>
#include <iostream>
#include <algorithm>

struct MyData
{
    int id;
    std::string info;

    MyData(int id, const std::string& info): id(id), info(info) {}

    // for sorting by id
    bool operator<(const MyData& d) const { return id < d.id; }
};

// function requires sorted data as input
std::vector<std::vector<MyData> > GroupByIDs(const std::vector<MyData>& data)
{
    std::vector<std::vector<MyData> > groups;

    decltype(data.end()) upper;

    for(auto lower = data.begin(); lower != data.end(); lower = upper)
    {
        // get the upper position of all elements with the same ID
        upper = std::upper_bound(data.begin(), data.end(), *lower);

        // add those elements as a group to the output vector
        groups.emplace_back(lower, upper);
    }

    return groups;
}

int main()
{
    std::vector<MyData> data {{2, "A"}, {4, "B"}, {3, "C"}, {4, "D"}, {9, "E"}, {3, "F"}};

    // function requires sorted data
    std::sort(data.begin(), data.end());
    std::vector<std::vector<MyData> > groups = GroupByIDs(data);

    for(auto const& group: groups)
    {
        if(!group.empty())
            std::cout << "group: " << group.front().id << '\n';

        for(auto const& d: group)
            std::cout << "     : " << d.info << '\n';

        std::cout << '\n';
    }
}

<强>输出：

group: 2
     : A

group: 3
     : C
     : F

group: 4
     : B
     : D

group: 9
     : E

Answer 2

你有点模糊，但我想你可以使用std::sort或朋友（std::stable_sort，std::partition，std::stable_partition）。然后使用std::copy从一个向量的迭代器到另一个向量。

Answer 3

如果使用哈希表，则可以执行以下操作：

Create a hash table that maps from ID to vector<MyData>
Iterate through the input data:
    If the hash table doesn't contain a vector for that ID:
        Create a vector<MyData> and add it to the hash table
    Push the input item into that vector<MyData>
Iterate through the entry set for the hash table:
    Put the vector<MyData> into the vector<vector<MyData>>
Return the vector<vector<MyData>>

这应该是O(n)平均情况。如果散列函数不好，我认为最坏的情况可能是O(n^2)。

Answer 4

您可以保存ID的地图，这些ID会在包含ID的ID和包含项目的向量之间进行映射。然后迭代输入，为每个元素检查地图的ID，如果它是新的，则创建一个新的矢量。您的复杂性将是O（NlogM），其中N是输入大小和M个可能的ID。

伪代码：

    for(Item in inputData)
            if(Item.ID in IDMap)
                    IDVec = IDMap[Item.ID]
                    IDVec.push(Item)
            else
                    IDVec = new Vector
                    IDMap.push(IDVec, Item.ID)
                    OutputVec.push(IDVec)

Answer 5

选项1：
按ID对输入进行排序，然后迭代已排序的输入，使用相同的ID累积和压缩数据序列，并使用向量的vector(iter, iter) CTor将它们复制到目标。

sort需要一个比较器：

bool less_than_by_ID(MyData const & a, MyData const & b) 
{ return a.ID < b.ID; }

选项2：
将项目保存在按ID键控的多个图表或多图表中，使用相应容器的lower_bound和upper_bound来获取各个键的范围。（在算法上，它是相同的）

选项3：
将输出数据结构更改为

std::map<int, std::vector<MyData> > outputData;

然后将数据抛出到容器中：

for(auto data : inputData)
  outuptData[data.ID].push_back(data);

Answer 6

我编辑了这个问题，通过相似性我的意思是相同的ID

实现：

auto MapByIDs(std::vector<MyData> inputData)
{
    std::map<std::vector<MyData>> result;
    for(auto &x: inputData)
        result[x.ID].emplace_back(std::move(x));
    return result;
}

auto GroupByIDs(std::vector<MyData> inputData)
{
    auto map = MapByIDs(std::move(inputData));
    std::vector<std::vector<MyData>> result;
    for(auto &x: map)
        result.emplace_back(std::move(x.second));
    return result;
}

auto outputData = GroupByIDs(std::move(inputData));

通过相似性对数据进行分组的最佳算法（相同ID）

6 个答案: