我有一个简单的脚本,首先读取CSV表(95MB,672343行)并从中创建5个列表(chrs,type,name,start,end)。然后它打开另一个文件(37MB,795516行),读取每一行并将其与之比较,如果一切正常 - 将字符串写入输出文件。这需要很多时间。

2 个答案:

答案 0 :(得分:2)

问题是,你迭代672343 * 795516 = 534'859'613'988次,这是很多。您需要一个更智能的解决方案。


这开始看起来很像数据库。因此,如果它是一个数据库,也许我们应该将其视为一个数据库。 Python附带了sqlite3。


import sqlite3
import csv

# create an in-memory database
conn = sqlite3.connect(":memory:")

# create the tables
c = conn.cursor()
c.execute("""CREATE TABLE t1 (
    chr   TEXT,
    type  TEXT,
    name  TEXT,
    start INTEGER,
    end   INTEGER

# if you only have a few columns, just name them all,
# if you have a lot, maybe just put everything in one
# column as a string
c.execute("""CREATE TABLE t2 (
    chr TEXT,
    num INTEGER,

# create indices on the columns we use for selecting
c.execute("""CREATE INDEX i1 ON t1 (chr, start, end);""")
c.execute("""CREATE INDEX i2 ON t2 (chr, num);""")

# fill the tables
with open("comparison_file.csv", 'rb') as f:
    reader = csv.reader(f)
    # sqlite takes care of converting the number-strings to numbers
    c.executemany("INSERT INTO t1 VALUES (?, ?, ?, ?, ?)", reader)

with open("input.csv", 'rb') as f:
    reader = csv.reader(f)
    # sqlite takes care of converting the number-strings to numbers
    c.executemany("INSERT INTO t2 VALUES (?, ?, ?, ?)", reader)

# now let sqlite do its magic and select the correct lines
c.execute("""SELECT t2.*, t1.* FROM t1
             JOIN t2 ON t1.chr == t2.chr
             WHERE t2.num BETWEEN t1.start AND t1.end;""")

# write result to disk
with open("output.csv", "wb") as f:
    writer = csv.writer(f)
    for row in c:



import csv

# used to be chrs[], type[], name[], start[], end[]
comparisons = []
with open("comparison_file.csv", 'rb') as f:
    reader = csv.reader(f)
    for chr, type, name, start, end in reader:
        comparisons.append([chr, type, name, int(start), int(end)])

with open("output.csv", 'wb') as out_file, \
     open("input.csv", 'rb') as in_file:
    writer = csv.writer(out_file)
    reader = csv.reader(in_file)

    for line in reader:
        for comp in comparisons:
            chr, _, _, end, start = *comp
            if line[0] == chr and \
               int(line[1]) >= start and \
               int(line[2]) >= end:
                writer.writerow(comp + line)


答案 1 :(得分:2)


不要将其转换为5个list个,而应将dict listtuplechr作为关键字import csv import collections import bisect # Use a defaultdict so we don't have to worry about whether a chr already exists foobars = collections.defaultdict(list) with open('file1.csv', 'rb') as csvfile: rdr = csv.reader(csvfile) for (chrs, typ, name, start, end) in rdr: foobars[chrs].append((int(start), int(end), typ, name))


然后对start中的每个列表进行排序(您应该将其重命名为适合您的任务的列表),这将首先按for lst in foobars.values(): lst.sort() 值排序,因为我们将其放在元组中的第一个:

for line in inputFile:
    line = line.rstrip('\n')
    arr = line.split('\t')
    arr1int = int(arr[1])
    # Since we rearranged our data, we only have to check one of our sublists
    search = foobars[arr[0]]
    # We use bisect to quickly find the first item where the start value
    # is higher than arr[1]
    highest = bisect.bisect(search, (arr1int + 1,))
    # Now we have a much smaller number of records to check, and we've 
    # already ensured that chr is a match, and arr[1] >= start
    for (start, end, typ, name) in search[:highest]:
        if arr1int <= end:
            outputFile.write('\t'.join((arr[0], typ, str(start), str(end), name, line)) + '\n')



bisect行应该得到一些额外的解释。如果您有一个已排序的值序列,start可用于查找将新值插入序列的位置。我们在此处使用它来查找列表中arr[1]大于(arr1int + 1,)的第一个值(花点时间思考这些概念是如何相关的)。奇怪的start == arr[1]值只是确保我们包含@IdClass的所有条目并将其转换为元组,以便我们比较相似的值。

