我从valgrind收到以下错误消息:
==1808== 0 bytes in 1 blocks are still reachable in loss record 1 of 1,734
==1808== at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==1808== by 0x4CC2BA9: hwloc_build_level_from_list (topology.c:1603)
==1808== by 0x4CC2BA9: hwloc_connect_levels (topology.c:1774)
==1808== by 0x4CC2F25: hwloc_discover (topology.c:2091)
==1808== by 0x4CC2F25: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==1808== by 0x4C60957: orte_odls_base_open (odls_base_open.c:205)
==1808== by 0x632FDB3: ???
==1808== by 0x4C3B6B9: orte_init (orte_init.c:127)
==1808== by 0x403E0E: orterun (orterun.c:693)
==1808== by 0x4035E3: main (main.c:13)
==1808==
==1808== 0 bytes in 1 blocks are still reachable in loss record 2 of 1,734
==1808== at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==1808== by 0x4CC2BD5: hwloc_build_level_from_list (topology.c:1603)
==1808== by 0x4CC2BD5: hwloc_connect_levels (topology.c:1775)
==1808== by 0x4CC2F25: hwloc_discover (topology.c:2091)
==1808== by 0x4CC2F25: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==1808== by 0x4C60957: orte_odls_base_open (odls_base_open.c:205)
==1808== by 0x632FDB3: ???
==1808== by 0x4C3B6B9: orte_init (orte_init.c:127)
==1808== by 0x403E0E: orterun (orterun.c:693)
==1808== by 0x4035E3: main (main.c:13)
我无法理解valgrind正在报告哪种问题。有谁愿意解释吗?
我已经检查了所有新实例。所有这些都已正确删除。
当代码结束时,我收到valgrind错误消息和进一步的错误形式mpi:
---------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1811 on node laki.pi.ingv.it exited on signal 11 (Segmentation fault).
----------------------------------------------------------------------
这是有关MPI_Init的错误消息:
==31198== 0 bytes in 1 blocks are still reachable in loss record 1 of 368
==31198== at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==31198== by 0xC66DE49: hwloc_build_level_from_list (topology.c:1603)
==31198== by 0xC66DE49: hwloc_connect_levels (topology.c:1774)
==31198== by 0xC66E1C5: hwloc_discover (topology.c:2091)
==31198== by 0xC66E1C5: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==31198== by 0xC62B473: opal_hwloc_unpack (hwloc_base_dt.c:83)
==31198== by 0xC6270AB: opal_dss_unpack_buffer (dss_unpack.c:120)
==31198== by 0xC62815F: opal_dss_unpack (dss_unpack.c:84)
==31198== by 0xC5F2349: orte_util_nidmap_init (nidmap.c:146)
==31198== by 0xED98608: ???
==31198== by 0xC5DC0B9: orte_init (orte_init.c:127)
==31198== by 0xC59DBAE: ompi_mpi_init (ompi_mpi_init.c:357)
==31198== by 0xC5B443F: PMPI_Init (pinit.c:84)
==31198== by 0x55FA53: main (solver_2d.hpp:22)
其中solver_2d.hpp:22行完全位于:
MPI_Init(&argc, &argv);
此外,与MPI_Finalize()有关的错误消息;是
==31198== 1 errors in context 1 of 58:
==31198== Syscall param write(buf) points to uninitialised byte(s)
==31198== at 0x38EF00E6FD: ??? (in /lib64/libpthread-2.12.so)
==31198== by 0x11F1F548: ???
==31198== by 0x11F1E03F: ???
==31198== by 0x11CD7FBA: ???
==31198== by 0x11CE519A: ???
==31198== by 0x11CE3C37: ???
==31198== by 0x11CD90C1: ???
==31198== by 0x11AC2E36: ???
==31198== by 0xC59ECC4: ompi_mpi_finalize (ompi_mpi_finalize.c:285)
==31198== by 0x562185: main (solver_2d.hpp:171)
==31198== Address 0x1ffeffda24 is on thread 1's stack
==31198== Uninitialised value was created by a stack allocation
==31198== at 0x11CCE050: ???
和
==31197== Syscall param write(buf) points to uninitialised byte(s)
==31197== at 0x38EF00E6FD: ??? (in /lib64/libpthread-2.12.so)
==31197== by 0x11F1F548: ipath_cmd_write (in /usr/lib64/libinfinipath.so.4.0)
==31197== by 0x11F1E03F: ipath_poll_type (in /usr/lib64/libinfinipath.so.4.0)
==31197== by 0x11CD7FBA: psmi_context_interrupt_set (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197== by 0x11CE519A: ips_ptl_rcvthread_fini (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197== by 0x11CE3C37: ??? (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197== by 0x11CD90C1: psm_ep_close (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197== by 0x11AC2E36: ompi_mtl_psm_finalize (mtl_psm.c:200)
==31197== by 0xC59ECC4: ompi_mpi_finalize (ompi_mpi_finalize.c:285)
==31197== by 0x562185: main (solver_2d.hpp:171)
==31197== Address 0x1ffeffda24 is on thread 1's stack
==31197== in frame #2, created by ipath_poll_type (???:)
==31197== Uninitialised value was created by a stack allocation
==31197== at 0x11CCE050: ??? (in /usr/lib64/libpsm_infinipath.so.1.15)
其中solver_2d.hpp:171行对应于:
MPI_Finalize();
最后,与MPI_write或更合适的是与MPI_File_open对应的错误消息为:
==31198== 48 bytes in 1 blocks are still reachable in loss record 104 of 368
==31198== at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==31198== by 0xC58C750: opal_obj_new (opal_object.h:469)
==31198== by 0xC58C750: ompi_attr_set_c (attribute.c:761)
==31198== by 0xC5AA0BE: PMPI_Attr_put (pattr_put.c:58)
==31198== by 0x118501AB: ???
==31198== by 0x11843159: ???
==31198== by 0x1185657D: ???
==31198== by 0xC5CEFB5: module_init (io_base_file_select.c:442)
==31198== by 0xC5CEFB5: mca_io_base_file_select (io_base_file_select.c:214)
==31198== by 0xC5977A5: ompi_file_open (file.c:128)
==31198== by 0xC5C6557: PMPI_File_open (pfile_open.c:96)
==31198== by 0x5638A1: p_fstream (p_fstream.hpp:86)
其中p_fstream.hpp:86行为:
MPI_File_open(MPI_COMM_WORLD, const_cast<char*>(fname.c_str()), flags, MPI_INFO_NULL, &mpi_file);
答案 0 :(得分:1)
valgrind
消息报告mpirun
中存在内存泄漏,您可能不太在意。
我想你跑了
valgrind mpirun a.out
,但是您真的想在MPI应用程序本身中查找不正确的内存访问/泄漏。在这种情况下,您应该运行
mpirun valgrind a.out
请注意,所有输出将被交错,并且由于您使用的是Open MPI,因此您可以
mpirun --tag-output valgrind a.out
为每个任务的输出添加其等级值作为前缀。