Question

我有两个.txt格式的点云文件（场景和绿色）。例如，场景点云通常包含超过100000条线，绿色包含20000条线。这两个文件的绿点有相等的行，但最后一个数字是每个点的标签。

场景：

0.805309, -3.43696, 6.85463, 0, 0, 0, 5
0.811636, -3.42248, 6.82576, 0, 0, 0, 5
-1.00663, 0.0985967, 3.02769, 42, 134, 83, 5
-1.00182, 0.098547, 3.02617, 43, 133, 83, 5
-0.997052, 0.0985018, 3.02478, 41, 133, 82, 5
0.811636, -3.42248, 6.82576, 0, 0, 0, 5

绿色：

-1.00663, 0.0985967, 3.02769, 42, 134, 83, 3
-1.00182, 0.098547, 3.02617, 43, 133, 83, 3
-0.997052, 0.0985018, 3.02478, 41, 133, 82, 3

我想用绿色文件中的相等行替换Scene的绿点中的整行，或者仅在两行相等的地方将标签号从5更改为3。最终结果将是这样的： 场景：

0.805309, -3.43696, 6.85463, 0, 0, 0, 5
0.811636, -3.42248, 6.82576, 0, 0, 0, 5
-1.00663, 0.0985967, 3.02769, 42, 134, 83, 3
-1.00182, 0.098547, 3.02617, 43, 133, 83, 3
-0.997052, 0.0985018, 3.02478, 41, 133, 82, 3
0.811636, -3.42248, 6.82576, 0, 0, 0, 5

我已经编写了两种类型的代码来执行此操作，但是由于要修改的文件很多，因此它们都加载了大量时间，这根本不好。第一个代码：

import os
import fileinput
def main(scene, others):

    for file in others:
        other = open(file, "r+")
        for line in other:
            line1 = line[:-3]
            f=scene
            for sceneLine in fileinput.input(f,inplace=True):
                new = sceneLine
                sceneLine1 = sceneLine[:-3]
                if sceneLine1 == line1:
                    print(sceneLine.replace(new, line), end='')
                else:
                    print(sceneLine.replace(line,line), end='')
            fileinput.close()


others = []
for file in os.listdir("./"):
    if file.endswith(".txt"):
        if file.startswith("pointCloudScene9863Cl"):
            scene = file
        else:
            others.append(file)

main(scene,others)

第二个代码：

import os
import fileinput
import numpy

def main(scene1, others):

    pointcloud = []
    scene1 = open(scene1,"r+")
    scene = []
    for each_point in scene1:
        scene.append(each_point)

    for file in others:
        other = open(file, "r+")
        for line in other:
            pointcloud = []
            line1 = line[:-3]
            for sceneLine in scene:
                sceneLine1 = sceneLine[:-3]
                if sceneLine1 == line1:
                    pointcloud.append(line)
                else:
                    pointcloud.append(sceneLine)
            scene = pointcloud

    with open('pointcloud.txt', 'w') as points:
        for item in scene:
            points.write("%s" % item)


others = []
for file in os.listdir("./"):
    if file.endswith(".txt"):
        if file.startswith("pointCloudScene9863Cl"):
            scene = file
        else:
            others.append(file)

main(scene,others)

这两种方法都能以很少的点数完美地工作，但是当我使用原始的点云文件时，则需要30分钟甚至更长的时间才能完成工作。当我基本上使用NESTED LOOPS时，我实际上在FOR LOOP中看到了问题，这意味着我将有100000 * 20000个循环来更改绿点。

是否有使用numpy数组或任何其他方法的有效方法？

Answer 1

我有一个应该是适当的解决方案，但是在此之前，我有一个免责声明：没有您的更多信息，找不到合适的解决方案是不可能的。我们需要此问题的上下文，以及有关数据格式和您要执行的操作的更精确和详细的信息。

例如，比较浮点数是否相等感觉并不好，并且就精度而言，通常对数字的操作总是会存在一定的风险，等等。由于这些问题似乎来自同一地点，因此如果每个人都有某种唯一的ID，可用于检查是否相等。

就像这里的其他人一样，我的第一个反应是抓住麻木和熊猫。就我而言，这是一个错误，因为此任务根本不涉及很多数据操作或转换。

那么，这是我现在能想到的最简单的实现：

def point_parse(line):
    line_point = line.split(", ")
    line_point[0] = float(line_point[0])
    line_point[1] = float(line_point[1])
    line_point[2] = float(line_point[2])
    line_point[3] = int(line_point[3])
    line_point[4] = int(line_point[4])
    line_point[5] = int(line_point[5])
    line_point[6] = int(line_point[6])
    return tuple(line_point)

green_points_set: frozenset
black_points_set: frozenset

with open("../resources/Green_long.txt", "r") as green_file:
    green_points_set = frozenset((point_parse(line)[:-1] for line in green_file))

with open("../resources/Black_long.txt", "r") as black_file:
    black_points_set = frozenset((point_parse(line)[:-1] for line in black_file))

def set_point_label(point):
    point_comp = point[:-1]
    if point_comp in green_points_set:
        point_comp += (3,)
    elif point_comp in black_points_set:
        point_comp += (4,)
    else:
        point_comp = point
    return point_comp

with open("../resources/Scene_long.txt", "r") as scene_file:
    scene_points_new = (set_point_label(point_parse(line)) for line in scene_file)
    form_lines = ((f"{res_line[0]}, {res_line[1]}, {res_line[2]}, {res_line[3]}, "
               f"{res_line[4]}, {res_line[5]}, {res_line[6]}\n") for res_line in scene_points_new)

    with open("../out/Scene_out.txt", "w") as scene_out:
        scene_out.writelines(form_lines)

代码非常简单。为绿色和黑色点创建了集合，我们测试了成员资格，并适当地更改了标签。

我为自己创建了一些训练数据：一个场景，总计1,000,000点，125,000绿点和125,000黑点。运行时不到7秒（希望我没有犯任何严重错误！），内存使用量应该很少。

Answer 2

我认为您应该问自己一些有关数据的基本问题：

文件中的顺序是否保留？我的意思是，您是否需要始终搜索整个文件，还是在某个位置找到绿点后可以跳过文件某些部分的比较？
100000条记录并不多。还会再有1000倍吗？您能否一次将整个文件读取到内存（Numpy数组或DataFrame）中，以便可以使用RAM和CPU缓存，而不是多次从磁盘读取？在最近发现的绿点上设置偏移量将是一个可行的选择。

Answer 3

使用numba jit编译的“蛮力”解决方案。只是为了好玩，最好使用frozenset-approach。最昂贵的操作似乎是mod_arr[j,:] = mod[i,:]期间的内存IO。

import timeit
import numpy as np
from numba import njit

### numba njit-ed version of nested loops
@njit
def modify(arr, mod, tol=0.000000001):
    mod_arr = arr[:]
    mask = np.ones(arr.shape[0]).astype(np.bool_)
    idx = np.arange(0, arr.shape[0], 1)
    for i in range(mod.shape[0]):
        for j in idx[mask]:
            if np.absolute(np.sum(arr[j,:-1]-mod[i,:-1])) < tol:
                mod_arr[j,:] = mod[i,:]
                mask[j] = False
    return mod_arr

# "scene":
a = np.array([[0.805309, -3.43696, 6.85463, 0, 0, 0, 5],
              [0.811636, -3.42248, 6.82576, 0, 0, 0, 5],
              [-1.00663, 0.0985967, 3.02769, 42, 134, 83, 5],
              [-1.00182, 0.098547, 3.02617, 43, 133, 83, 5],
              [-0.997052, 0.0985018, 3.02478, 41, 133, 82, 5],
              [0.811636, -3.42248, 6.82576, 0, 0, 0, 5]])
# "green":
m = np.array([[-1.00663, 0.0985967, 3.02769, 42, 134, 83, 3],
              [-1.00182, 0.098547, 3.02617, 43, 133, 83, 3],
              [-0.997052, 0.0985018, 3.02478, 41, 133, 82, 3]])
# desired output:
mod_arr_test = np.array([[0.805309, -3.43696, 6.85463, 0, 0, 0, 5],
                         [0.811636, -3.42248, 6.82576, 0, 0, 0, 5],
                         [-1.00663, 0.0985967, 3.02769, 42, 134, 83, 3],
                         [-1.00182, 0.098547, 3.02617, 43, 133, 83, 3],
                         [-0.997052, 0.0985018, 3.02478, 41, 133, 82, 3],
                         [0.811636, -3.42248, 6.82576, 0, 0, 0, 5]])
# check:
mod_arr = modify(a, m)
print([np.isclose(np.sum(mod_arr[i] - l), 0.) for i, l in enumerate(mod_arr_test)])
# -->
[True, True, True, True, True, True]

# now let's make the arrays big...
a = np.tile(a, (17000, 1)) # a.shape is (102000, 7)
m = np.tile(m, (7000, 1)) # m.shape is (21000, 7)

### performance check:
%timeit modify(a, m)
# -->
2min 55s ± 4.07 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

用另一个文件的行替换文件的行

3 个答案: