Question

我有2个csv文件（好吧，其中一个是.tab），它们都有2列数字。我的工作是遍历第一个文件的每一行，看看它是否与第二个文件中的任何行匹配。如果是，我在输出csv文件中打印一个空行。否则，我将'R，R'打印到输出csv文件。我当前的算法执行以下操作：

扫描第二个文件的每一行（每个两个整数），转到2D数组中这两个整数的位置（如果整数是2和3，我将转到位置[2,3]）并指定值1。
浏览第一个文件的每一行，检查每行中两个整数的位置是否在数组中的值为1，然后将相应的输出打印到第三个csv文件。

不幸的是csv文件非常大，所以我在运行时会立即得到“MemoryError：”。扫描大型csv文件的替代方法是什么？

我正在使用Jupyter Notebook。我的代码：

import csv
import numpy

def SNP():
    thelines = numpy.ndarray((6639,524525))
    tempint = 0
    tempint2 = 0
    with open("SL05_AO_RO.tab") as tsv:
        for line in csv.reader(tsv, dialect="excel-tab"):
            tempint = int(line[0])
            tempint2 = int(line[1])
            thelines[tempint,tempint2] = 1
    return thelines

def common_sites():
    tempint = 0
    tempint2 = 0
    temparray = SNP()
    print('Checkpoint.')
    with open('output_SL05.csv', 'w', newline='') as fp:
        with open("covbreadth_common_sites.csv") as tsv:
            for line in csv.reader(tsv, dialect="excel-tab"):
                tempint = int(line[0])
                tempint2 = int(line[1])
                if temparray[tempint,tempint2] == 1:
                    a = csv.writer(fp, delimiter=',')
                    data = [['','']]
                    a.writerows(data)
                else:
                    a = csv.writer(fp, delimiter=',')
                    data = [['R','R']]
                    a.writerows(data)
    print('Done.')
    return

common_sites()

文件： https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing和https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing

Answer 1

你的数据集真的不是那么大，但它相对稀疏。您没有使用稀疏结构来存储导致问题的数据只需使用<a href="#secondPage">Start cicle</a>元组来存储看到的数据，然后set上的查找为set，例如：

O(1)

Answer 2

我有2个csv文件（好吧，其中一个是.tab），它们都有2列数字。我的工作是遍历第一个文件的每一行，看看它是否与第二个文件中的任何行匹配。如果是，我在输出csv文件中打印一个空行。否则，我将'R，R'打印到输出csv文件。

import numpy as np

f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')

f1.sort(axis=0)
f2.sort(axis=0)

i, j = 0, 0
while i < f1.shape[0]:
    while j < f2.shape[0] and f1[i][0] > f2[j][0]:
        j += 1
    while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
        j += 1
    if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
        print()
    else:
        print('R,R')
    i += 1

将数据加载到ndarray以优化内存使用
排序数据
在排序数组中查找匹配项

总复杂度为O(n*log(n) + m*log(m))，其中n和m是输入文件的大小。

使用set()不会减少每个唯一条目的内存使用量，因此我不建议将其用于大型数据集。

如何搜索非常大的csv文件？

2 个答案: