Boost ::进程间共享内存总线错误

时间:2010-06-23 15:51:12

标签: c++ linux boost shared-memory boost-interprocess

我在使用Open-MPI 1.3.3的群集上使用CentOS 5.4 x86_64和Boost 1.42.0。我正在编写一个共享库,它使用共享内存来存储大量数据,供多个进程使用。还有一个加载器应用程序,它将读取文件中的数据并将它们加载到共享内存中。

当我运行加载器应用程序时,它确定了准确存储数据所需的内存量,然后增加了25%的开销。对于几乎每个文件,它将超过2演出数据。当我使用Boost的Interprocess库发出内存请求时,它表示它已成功保留了所请求的内存量。但是当我使用start开始使用它时,我得到一个“总线错误”。据我所知,总线错误是访问内存段可用范围之外的内存的结果。

所以我开始研究如何在Linux上共享内存以及检查什么以确保我的系统配置正确以允许大量共享内存。

  1. 我查看了/proc/sys/kernel/shm*处的“文件”:
    • shmall - 4294967296(4 Gb)
    • shmmax - 68719476736(68 Gb)
    • shmmni - 4096

  2. 我调用了ipcs -lm命令:
    ------ Shared Memory Limits --------
    max number of segments = 4096
    max seg size (kbytes) = 67108864
    max total shared memory (kbytes) = 17179869184
    min seg size (bytes) = 1
  3. 据我所知,这些设置表明我应该能够为我的目的分配足够的共享内存。所以我创建了一个在共享内存中创建大量数据的精简程序:

    
    #include <iostream>
    
    #include <boost/interprocess/managed_shared_memory.hpp>
    #include <boost/interprocess/allocators/allocator.hpp>
    #include <boost/interprocess/containers/vector.hpp>
    
    namespace bip = boost::interprocess;
    
    typedef bip::managed_shared_memory::segment_manager segment_manager_t;
    typedef bip::allocator<long, segment_manager_t> long_allocator;
    typedef bip::vector<long, long_allocator> long_vector;
    
    int main(int argc, char ** argv) {
        struct shm_remove  { 
            shm_remove()    { bip::shared_memory_object::remove("ShmTest"); } 
            ~shm_remove()   { bip::shared_memory_object::remove("ShmTest"); } 
        } remover; 
    
        size_t szLength = 280000000;
        size_t szRequired = szLength * sizeof(long);
        size_t szRequested = (size_t) (szRequired * 1.05);
        bip::managed_shared_memory segment(bip::create_only, "ShmTest", szRequested); 
    
        std::cout << 
            "Length:       " << szLength << "\n" <<
            "sizeof(long): " << sizeof(long) << "\n" <<
            "Required:     " << szRequired << "\n" <<
            "Requested:    " << szRequested << "\n" <<
            "Allocated:    " << segment.get_size() << "\n" <<
            "Overhead:     " << segment.get_size() - segment.get_free_memory() << "\n" <<
            "Free:         " << segment.get_free_memory() << "\n\n";
    
        long_allocator alloc(segment.get_segment_manager()); 
        long_vector vector(alloc);
    
        if (argc > 1) {
            std::cout << "Reserving Length of " << szLength << "\n";
            vector.reserve(szLength);
            std::cout << "Vector Capacity: " << vector.capacity() << "\tFree: " << segment.get_free_memory() << "\n\n";
        }
    
        for (size_t i = 0; i < szLength; i++) {
            if ((i % (szLength / 100)) == 0) {
                std::cout << i << ": " << "\tVector Capacity: " << vector.capacity() << "\tFree: " << segment.get_free_memory() << "\n";
            }
            vector.push_back(i);    
        }
        std::cout << "end: " << "\tVector Capacity: " << vector.capacity() << "\tFree: " << segment.get_free_memory() << "\n";
    
        return 0;
    }
    

    用以下行编译:

    g++ ShmTest.cpp -lboost_system -lrt

    然后使用以下输出运行它(编辑使其变小):

    Length:       280000000
    sizeof(long): 8
    Required:     2240000000
    Requested:    2352000000
    Allocated:    2352000000
    Overhead:     224
    Free:         2351999776
    
    0:      Vector Capacity: 0      Free: 2351999776
    2800000:        Vector Capacity: 3343205        Free: 2325254128
    5600000:        Vector Capacity: 8558607        Free: 2283530912
    8400000:        Vector Capacity: 8558607        Free: 2283530912
    11200000:       Vector Capacity: 13693771       Free: 2242449600
    14000000:       Vector Capacity: 21910035       Free: 2176719488
    ...
    19600000:       Vector Capacity: 21910035       Free: 2176719488
    22400000:       Vector Capacity: 35056057       Free: 2071551312
    ...
    33600000:       Vector Capacity: 35056057       Free: 2071551312
    36400000:       Vector Capacity: 56089691       Free: 1903282240
    ...
    56000000:       Vector Capacity: 56089691       Free: 1903282240
    58800000:       Vector Capacity: 89743507       Free: 1634051712
    ...
    89600000:       Vector Capacity: 89743507       Free: 1634051712
    92400000:       Vector Capacity: 143589611      Free: 1203282880
    ...
    142800000:      Vector Capacity: 143589611      Free: 1203282880
    145600000:      Vector Capacity: 215384417      Free: 628924432
    ...
    212800000:      Vector Capacity: 215384417      Free: 628924432
    215600000:      Vector Capacity: 293999969      Free: 16
    ...
    260400000:      Vector Capacity: 293999969      Free: 16
    Bus error
    

    如果使用a参数运行程序(任何都可以,只需要增加argc),它会预先分配向量,但仍然会在同一个数组索引处导致总线错误。

    我使用/dev/shm命令检查了ls -ash /dev/shm处“文件”的大小:

    total 2.0G
       0 .     0 ..  2.0G ShmTest
    

    就像我的原始应用程序一样,分配的共享内存的大小上限为2 gigs。鉴于它“成功”分配了2352000000字节的内存,以千兆字节(使用1024 * 1024 * 1024),它应该是2.19 Gb。

    当我运行我的实际程序以使用MPI加载数据时,我收到此错误输出:

    Requested: 2808771120
    Recieved: 2808771120
    
    [c1-master:13894] *** Process received signal ***
    [c1-master:13894] Signal: Bus error (7)
    [c1-master:13894] Signal code:  (2)
    [c1-master:13894] Failing at address: 0x2b3190157000
    [c1-master:13894] [ 0] /lib64/libpthread.so.0 [0x3a64e0e7c0]
    [c1-master:13894] [ 1] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost12interprocess26uninitialized_copy_or_moveINS0_10offset_ptrIlEEPlEET0_T_S6_S5_PNS_10disable_ifINS0_11move_detail16is_move_iteratorIS6_EEvE4typeE+0x218) [0x2b310dcf3fb8]
    [c1-master:13894] [ 2] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost9container6vectorIlNS_12interprocess9allocatorIlNS2_15segment_managerIcNS2_15rbtree_best_fitINS2_12mutex_familyENS2_10offset_ptrIvEELm0EEENS2_10iset_indexEEEEEE15priv_assign_auxINS7_IlEEEEvT_SG_St20forward_iterator_tag+0xa75) [0x2b310dd0a335]
    [c1-master:13894] [ 3] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost9container17containers_detail25advanced_insert_aux_proxyINS0_6vectorIlNS_12interprocess9allocatorIlNS4_15segment_managerIcNS4_15rbtree_best_fitINS4_12mutex_familyENS4_10offset_ptrIvEELm0EEENS4_10iset_indexEEEEEEENS0_17constant_iteratorISF_lEEPSF_E25uninitialized_copy_all_toESI_+0x1d7) [0x2b310dd0b817]
    [c1-master:13894] [ 4] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost9container6vectorINS1_IlNS_12interprocess9allocatorIlNS2_15segment_managerIcNS2_15rbtree_best_fitINS2_12mutex_familyENS2_10offset_ptrIvEELm0EEENS2_10iset_indexEEEEEEENS3_ISD_SB_EEE17priv_range_insertENS7_ISD_EEmRNS0_17containers_detail23advanced_insert_aux_intISD_PSD_EE+0x771) [0x2b310dd0d521]
    [c1-master:13894] [ 5] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost12interprocess6detail8Ctor3ArgINS_9container6vectorINS4_IlNS0_9allocatorIlNS0_15segment_managerIcNS0_15rbtree_best_fitINS0_12mutex_familyENS0_10offset_ptrIvEELm0EEENS0_10iset_indexEEEEEEENS5_ISF_SD_EEEELb0EiSF_NS5_IvSD_EEE11construct_nEPvmRm+0x157) [0x2b310dd0d9a7]
    [c1-master:13894] [ 6] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost12interprocess15segment_managerIcNS0_15rbtree_best_fitINS0_12mutex_familyENS0_10offset_ptrIvEELm0EEENS0_10iset_indexEE28priv_generic_named_constructIcEEPvmPKT_mbbRNS0_6detail18in_place_interfaceERNS7_INSE_12index_configISB_S6_EEEENSE_5bool_ILb1EEE+0x6fd) [0x2b310dd0c85d]
    [c1-master:13894] [ 7] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN5boost12interprocess15segment_managerIcNS0_15rbtree_best_fitINS0_12mutex_familyENS0_10offset_ptrIvEELm0EEENS0_10iset_indexEE22priv_generic_constructEPKcmbbRNS0_6detail18in_place_interfaceE+0xf8) [0x2b310dd0dd58]
    [c1-master:13894] [ 8] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN7POP_LTL16ExportPopulation22InitializeSharedMemoryEPKc+0x1609) [0x2b310dceea99]
    [c1-master:13894] [ 9] ../LookupPopulationLib/Release/libLookupPopulation.so(_ZN7POP_LTL10InitializeEPKc+0x349) [0x2b310dd0ebb9]
    [c1-master:13894] [10] MPI_Release/LookupPopulation.MpiLoader(main+0x372) [0x4205d2]
    [c1-master:13894] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3a6461d994]
    [c1-master:13894] [12] MPI_Release/LookupPopulation.MpiLoader(__gxx_personality_v0+0x239) [0x420009]
    [c1-master:13894] *** End of error message ***
    --------------------------------------------------------------------------
    mpirun noticed that process rank 0 with PID 13894 on node c1-master exited on signal 7 (Bus error).
    --------------------------------------------------------------------------
    

    我真的不知道该怎么做。有没有人有什么建议尝试?


    发布到Boost bug trac:https://svn.boost.org/trac/boost/ticket/4374

1 个答案:

答案 0 :(得分:9)

好吧,如果你一直在寻找答案......

在Linux上,它默认使用的共享内存机制(tmpfs)将其限制为系统RAM的一半。所以在我的集群上,它是2 Gb,因为我们有4 Gb系统RAM。因此,当它尝试分配共享内存段时,它会分配到/dev/shm左侧的最大大小。

但是当Boost库无法分配所请求的内存量时,Boost库没有指示错误甚至报告正确的可用内存量时出现了问题。它很高兴看起来很明显,直到它到达片段结束然后出错。

长期解决方案是更新/etc/fstab文件以永久更改,但可以运行命令行调用以增加每个节点上可用共享内存的大小,直到重新启动。

mount -o remount,size=XXX /dev/shm

XXX是要提供的内存量(例如size=4G)。

这是从http://www.cyberciti.biz/tips/what-is-devshm-and-its-practical-usage.html

计算出来的