使用Boost查找大图的连通组件

时间:2014-07-26 15:01:58

标签: c++ boost graph bigdata

我写了一个代码,用于查找非常大的图形的连通分量(8000万条边) 但是它不起作用,当边数接近4千万时就崩溃了。

int main(){
    using namespace boost;
    {
        int node1,node2;
        typedef adjacency_list <vecS, vecS, undirectedS> Graph;
        Graph G;
        std::ifstream infile("pairs.txt");
        std::string line;
        while (std::getline(infile,line))
        {
            std::istringstream iss(line);
            iss >> node1 >> node2;
            add_edge(node1, node2, G);}
            cout <<"writing file"<<endl;
            int  j = 0;
            ofstream out;
            out.open("connected_component.txt");
            std::vector<int> component(num_vertices(G));
            int num = connected_components(G, &component[0]);
            std::vector<int>::size_type i;
            for (i = 0; i != component.size(); ++i){
                out << i << "\t "<<component[i] <<endl;}
                out.close();
            }

我怎么能用boost来做到这一点?或更改我的图表数据类型?

1 个答案:

答案 0 :(得分:1)

根据Massif,使用随机图形数据,我可以在大约37s(peaking at 4.4GiB of memory)中运行4000万个边缘。

  

/tmp$ od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 > pairs.txt
   /tmp$ time ./test

Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms

real    0m37.339s
user    0m36.078s
sys 0m1.202s

1。内存分配

但是请注意,我通过使用边缘列表的向量来调整它,以便我可以保留所需的容量:

typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, 
         no_property, vecS> Graph;

  • 通过删除重新分配来增强负载性能
  • 减少堆碎片

2。顶点id缩放

一个重要的注意事项是,存储要求会随着顶点的数量而缩放。更具体地说,它们使用顶点的域缩放。例如。加载这样的文件:

1 7
2 7
5 6
4 9

的内存要求比

1 70000
2 70000
5 60000
4 90000

事实上,重新运行上述基准,输入完全相同,但第一行改变了

 47662 60203

 476624766 602036020

结果如下:

Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @  0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms

real    2m20.318s
user    1m30.224s
sys 0m49.821s

正如你所看到的那样,google的malloc实现(来自gperftools)甚至会警告特别大的分配,实际上它的运行速度要慢得多。 (哦,内存使用变得像Massif不再那样了,但我看到它在htop中达到了23GiB。)

完整代码

看到 Live On Coliru 在4000条边上运行:

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>

#include <chrono>

using Clock = std::chrono::high_resolution_clock;

int main()
{
    using namespace boost;
    typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
    Graph G;

    // read edges
    auto start = Clock::now();
    std::ifstream infile("pairs.txt", std::ios::binary);

    std::vector<std::pair<int, int> > as_read;

    int node1, node2;
    while (infile >> node1 >> node2)
        as_read.emplace_back(node1, node2);

    std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // build graph
    G.m_edges.reserve(as_read.size());
    for(auto& pair : as_read)
        add_edge(pair.first,pair.second,G);

    std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // find connected components
    std::vector<int> component(num_vertices(G));
    int num = connected_components(G, &component[0]);

    std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // write output
    std::cout <<"writing file"<<std::endl;

    std::ofstream out;
    out.open("connected_component.txt");
    for (size_t i = 0; i != component.size(); ++i) {
        out << i << "\t "<< component[i] << std::endl; 
    }

    out.close();
    std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}