Question

我正在尝试在大型数据集中查找匹配项的位置（N或-）。每个字符串（300万个字母）的匹配数约为300,000。我有110个字符串要在同一文件中搜索，所以我使用re.finditer进行了循环以匹配并报告每个匹配项的位置，但是这花费了很长时间。每个字符串（DNA序列）仅由六个字符（ATGCN-）组成。 11小时内仅处理了17个字符串。问题是我该怎么做才能加快流程？我正在谈论的代码部分是：

for found in re.finditer(r"[-N]", DNA_sequence):
    position = found.start() + 1
    positions_list.append(position)
    positions_set = set(positions_list)
all_positions_set = all_positions_set.union(positions_set)
count += 1
print(str(count) + '\t' +record.id+'\t'+'processed')
output_file.write(record.id+'\t'+str(positions_list)+'\n')

我还在Google搜索中尝试使用re.compile，发现它可以提高性能，但没有任何变化（match = re.compile（'[-N]'））

Answer 1

如果您大约有30万个匹配项，那么您将重新创建越来越大的set，其中包含与您已经添加到的list完全相同的元素：

for found in re.finditer(r"[-N]", DNA_sequence):
    position = found.start() + 1
    positions_list.append(position)
    positions_set = set(positions_list) # 300k times ... why? why at all?

相反，您可以简单地使用找到的列表，在找到所有列表之后将其放入all_positions_set中。

all_positions_set = all_positions_set.union(positions_list) # union takes any iterable

那应该减少50％以上的内存（设置比列表更昂贵），并且还大大减少了运行时间。

我不确定哪个更快，但是您甚至可以跳过使用正则表达式：

t = "ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-ATGCN-"

pos = []
for idx,c in enumerate(t):
    if c in "N-":
        pos.append(idx)

print(pos)  # [4, 5, 10, 11, 16, 17, 22, 23, 28, 29, 34, 35, 40, 41, 46, 47]

，而是在字符串上使用enumerate()来查找位置...。您需要测试这样做是否更快。

Answer 2

关于不使用正则表达式，我确实做到了，现在使用定义的函数将脚本修改为在不到45秒的时间内运行

changeValue

所以新的编码部分是：

changeValue(val) {
    this.food$ = val;
}

加快大型数据集的正则表达式查找器

2 个答案: