我写了一个代码,用于查找非常大的图形的连通分量(8000万条边) 但是它不起作用,当边数接近4千万时就崩溃了。
int main(){
using namespace boost;
{
int node1,node2;
typedef adjacency_list <vecS, vecS, undirectedS> Graph;
Graph G;
std::ifstream infile("pairs.txt");
std::string line;
while (std::getline(infile,line))
{
std::istringstream iss(line);
iss >> node1 >> node2;
add_edge(node1, node2, G);}
cout <<"writing file"<<endl;
int j = 0;
ofstream out;
out.open("connected_component.txt");
std::vector<int> component(num_vertices(G));
int num = connected_components(G, &component[0]);
std::vector<int>::size_type i;
for (i = 0; i != component.size(); ++i){
out << i << "\t "<<component[i] <<endl;}
out.close();
}
我怎么能用boost来做到这一点?或更改我的图表数据类型?
答案 0 :(得分:1)
根据Massif,使用随机图形数据,我可以在大约37s(peaking at 4.4GiB of memory)中运行4000万个边缘。
/tmp$
od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 > pairs.txt
/tmp$
time ./test
Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms
real 0m37.339s
user 0m36.078s
sys 0m1.202s
但是请注意,我通过使用边缘列表的向量来调整它,以便我可以保留所需的容量:
typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property,
no_property, vecS> Graph;
此
一个重要的注意事项是,存储要求会随着顶点的数量而缩放。更具体地说,它们使用顶点的域缩放。例如。加载这样的文件:
1 7
2 7
5 6
4 9
将 的内存要求比
少1 70000
2 70000
5 60000
4 90000
事实上,重新运行上述基准,输入完全相同,但仅第一行改变了
47662 60203
到
476624766 602036020
结果如下:
Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @ 0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @ 0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @ 0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms
real 2m20.318s
user 1m30.224s
sys 0m49.821s
正如你所看到的那样,google的malloc实现(来自gperftools)甚至会警告特别大的分配,实际上它的运行速度要慢得多。 (哦,内存使用变得像Massif不再那样了,但我看到它在htop中达到了23GiB。)
看到 Live On Coliru 在4000条边上运行:
#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>
#include <chrono>
using Clock = std::chrono::high_resolution_clock;
int main()
{
using namespace boost;
typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
Graph G;
// read edges
auto start = Clock::now();
std::ifstream infile("pairs.txt", std::ios::binary);
std::vector<std::pair<int, int> > as_read;
int node1, node2;
while (infile >> node1 >> node2)
as_read.emplace_back(node1, node2);
std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
start = Clock::now();
// build graph
G.m_edges.reserve(as_read.size());
for(auto& pair : as_read)
add_edge(pair.first,pair.second,G);
std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
start = Clock::now();
// find connected components
std::vector<int> component(num_vertices(G));
int num = connected_components(G, &component[0]);
std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
start = Clock::now();
// write output
std::cout <<"writing file"<<std::endl;
std::ofstream out;
out.open("connected_component.txt");
for (size_t i = 0; i != component.size(); ++i) {
out << i << "\t "<< component[i] << std::endl;
}
out.close();
std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}