Question

我有一个包含3M行的文件 A ，我有一个阵列 arr 包含两列500k项目（ col1 ，< strong> col2 ），我需要检查文件 A 中的哪些行与 arr 中的 col1 匹配的信息并形成下一行的字符串，并将 col2 连接到字符串的末尾。

这是我创建的示例，用于显示代码的逻辑：

def myfun(file, arr):

arr = quicksort(arr) # quick sort custom implementation
                     # arr will have duplicate values

with open(file, 'r', encoding='utf8') as f:
    for line in f:
        if line.startswith('something'):

            lineparts = line.split() #line will be space seperated columns
                                     #we need second column

            idx = binary_search(arr, lineparts[1]) #index of col2 value in arr if found
                                                   # -1 if not found

            if idx != -1:           #if found store col2 value for later use
                temp_var = arr[idx][1]
                del arr[idx]        # delete arr entry as it's not needed anymore
            else:
                #do something
                temp_var = '0'

            #
            #   do concatenation of strings in lines as needed
            #   after finishing preparing the needed string
            #   write it to a new file
            #

此代码工作正常，但耗时。有没有更好的方法来完成这些任务？ “假设 quicksort 和 binary_search 以最合适的方式实施”。

Answer 1

因为看起来有些东西从我的眼睛里滑落，答案对我的问题非常明显。

删除数组条目会导致任何编程语言分配另一个数组，复制原始数组而没有删除元素，因此运行时间过长。

注意：它应该是一个通过索引随机访问的普通数组，只是为了澄清。

删除行del arr[idx]后，运行时间从72秒降至6秒。

但是，如果有更好的方法，请留下答案！

交叉检查Python中字符串列表的文件行

1 个答案: