之后呢？

Question

如果我们有以下输入，并且我们希望保留行，如果他们的“APPID colum”（第4列）相同并且他们的列“Category”（第18列）是一个“Cell”和一个“Biochemical” “或”一个“细胞”和一个“酶”。

A，APPID，C，APP_ID，D，E，F，G，H，I，J，K，L，M，O，P，Q，类别，S，T
  ,,, APP-1 ,,,,,,,,,,,,,,, Cell ,,
  ,,, APP-1 ,,,,,,,,,,,,,,,酶，
  ,,, APP-2 ,,,,,,,,,,,,,,,,,,,,,   ,,, APP-3 ,,,,,,,,,,,,,,,,,,,,   ,,, APP-3 ,,,,,,,,,,,,,,, Biochemical ,,

理想的输出将是

A，APPID，C，APP_ID，D，E，F，G，H，I，J，K，L，M，O，P，Q，类别，S，T
  ,,, APP-1 ,,,,,,,,,,,,,,,酶，
  ,,, APP-3 ,,,,,,,,,,,,,,, Biochemical ,,
  ,,, APP-1 ,,,,,,,,,,,,,,, Cell ,,
  ,,, APP-3 ,,,,,,,,,,,,,, Cell ,,

保留“APP-1”是因为它们的第3列是相同的，它们的类别是一个“细胞”而另一个是“酶”。同样的事情是“APP-3”，它有一个“细胞”，另一个是“生物化学”在其“类别”栏中。

以下尝试可以解决问题：

import os

App=["1"]

for a in App:
    outname="App_"+a+"_target_overlap.csv"
    out=open(outname,'w')
    ticker=0
    cell_comp_id=[]
    final_comp_id=[]

    # make compound with cell activity (to a target) list first

    filename="App_"+a+"_target_Detail_average.csv"
    if os.path.exists(filename):
        file = open (filename)
        line=file.readlines()
        if(ticker==0): # Deal with the title
            out.write(line[0])
            ticker=ticker+1

            for c in line[1:]:
                c=c.split(',')
                if(c[17]==" Cell "):
                     cell_comp_id.append(c[3])
    else:
        cell_comp_id=list(set(cell_comp_id))

# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for comp in cell_comp_id:
                if (c[3]==comp and c[17]==" Biochemical "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
                elif (c[3]==comp and c[17]==" Enzyme "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
    else:
        final_comp_id=list(set(final_comp_id))

# After we obatin a final compound list in target a , we go through all the csv again for output the cell data

    filename="App_"+a+"_target_Detail_average.csv"

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for final in final_comp_id:
                if (c[3]==final and c[17]==" Cell "):
                    out.write(str(temporary_line))

    out.close()

当输入文件很小（数万行）时，此脚本可以在合理的时间内完成其工作。但是，输入文件会变成数百万到数十亿行，这个脚本将需要永远完成（天......）。我认为问题是我们在第18列创建了一个带有“Cell”的APPID列表。然后我们回去比较这个“Cell”列表（可能是50万行）到整个文件（例如100万行）：如果Cell列表中的任何APPID和整个文件是相同的，那么行的第18列在整个文件中是“酶”或“生物化学”，我们保留信息。这一步似乎非常耗时。

我在考虑准备“细胞”，“酶”和“生物化学”词典并比较它们会更快吗？我可以知道是否有任何大师有更好的方法来处理它？任何示例/评论都会有所帮助。感谢。

我们使用python 2.7.6。

Answer 1

有效地阅读文件

一个大问题是，您使用readlines一次性读取文件。这将需要一次性将其全部加载到内存中。我怀疑你是否有那么多的记忆。

尝试：

with open(filename) as fh:
    out.write(fh.readline()) # ticker
    for line in fh: #iterate through lines 'lazily', reading as you go.
        c = line.split(',')

开始的样式代码。这应该有很多帮助。在这里，在上下文中：

# make compound with cell activity (to a target) list first

if os.path.exists(filename):
    with open(filename) as fh:
        out.write(fh.readline()) # ticker
        for line in fh:
            cols = line.split(',')
            if cols[17] == " Cell ":
                cell_comp_id.append(cols[3])

with open(...) as语法是一种非常常见的python习惯用法，它会在您完成with块时自动处理关闭文件，或者出现错误。非常有用。

集

接下来就是，正如你所建议的，使用sets好一点。

您不需要每次都重新创建该集，您只需更新它即可添加项。这里有一些示例set代码（以python interperter样式编写，>>>开头意味着它要输入一行内容 - 实际上不要键入{ {1}}位！）：

>>>

因此您可以添加项目，并从集合中删除它们而无需从头开始创建新集合（每次使用>>> my_set = set() >>> my_set set() >>> my_set.update([1,2,3]) >>> my_set set([1,2,3]) >>> my_set.update(["this","is","stuff"]) >>> my_set set([1,2,3,"this","is","stuff"]) >>> my_set.add('apricot') >>> my_set set([1,2,3,"this","is","stuff","apricot"]) >>> my_set.remove("is") >>> my_set set([1,2,3,"this","stuff","apricot"])位时都会这样做。

您还可以获得差异，交叉点等：

cell_comp_id=list(set(cell_comp_id))

有关详细信息，请参阅the docs。

所以让我们试试：

>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])

>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])

现在你有一套细胞，生物化学和酶。你只想要这些的横截面，所以：

cells = set()
enzymes = set()
biochemicals = set()

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells.add(row_id)
        elif row_category == ' Biochemical ':
            biochemicals.add(row_id)
        elif row_category == ' Enzyme ':
            enzymes.add(row_id)

然后，您可以再次浏览所有文件，只需检查cells_and_enzymes = cells & enzymes cells_and_biochemicals = cells & biochemicals（或row_id）是否在这些列表中，如果是，请打印它。

您实际上可以进一步合并这两个列表：

c[3]

这将是具有酶或生物化学物质的细胞。

那么当你第二次浏览文件时，你可以这样做：

cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals

之后呢？

仅仅使用这些建议可能足以让您满意。然而，你仍然在存储器中存储整套细胞，生物化学物质和酶。而且你还在两次运行文件。

所以我们有两种方法可以加快速度，同时仍然使用单个python进程。我不知道你有多少记忆。如果你的内存不足，那么它可能会稍微减慢速度。

我们去的时候减少了设置。

如果您确实有一百万条记录，其中800,000条是成对的（有一个单元格记录和生化记录），那么当您到达列表末尾时，您需要存储800000个ID 。为了减少内存使用量，一旦我们确定我们想要输出记录，我们就可以将该信息（我们要打印记录）保存到磁盘上的文件中，并停止将其存储在内存中。然后我们可以稍后阅读该列表以找出要打印的记录。

由于这会增加磁盘IO，因此可能会更慢。但是如果你的内存不足，它可能会减少交换，从而加快速度。这很难说。

if row_id in cells_with_enz_or_bio:
    out.write(line)

一旦您阅读了所有文件，您现在就拥有了一个文件（with open('to_output.tmp','a') as to_output: for a in App: # ... do your reading thing into the sets ... if row_id in cells and (row_id in biochemicals or row_id in enzymes): to_output.write('%s,' % row_id) cells.remove(row_id) biochemicals.remove(row_id) enzymes.remove(row_id)），其中包含您要保留的所有ID。所以你可以把它读回python：

to_output.tmp

这意味着您可以在第二次浏览文件时简单地说：

with open('to_output.tmp') as ids_file:
    ids_to_keep = set(ids_file.read().split(','))

使用`if row_id in ids_to_keep: out.write(line)`代替集合：

如果你有足够的内存，你可以绕过所有这些并使用dict来存储数据，这样你就可以只运行一次文件，而不是使用集合。

dict

此方法的问题在于，如果任何行重复，则会混淆。

如果您确定输入记录是唯一的，并且具有酶或生化记录，但不是两者都有，那么您可以轻松地添加cells = {} enzymes = {} biochemicals = {} with open(filename) as fh: out.write(fh.readline()) #ticker for line in fh: cols = line.split(',') row_id = cols[3] row_category = cols[17] if row_category == ' Cell ': cells[row_id] = line elif row_category == ' Biochemical ': biochemicals[row_id] = line elif row_category == ' Enzyme ': enzymes[row_id] = line if row_id in cells and row_id in biochemicals: out.write(cells[row_id]) out.write(biochemicals[row_id]) if row_id in enzymes: out.write(enzymes[row_id]) elif row_id in cells and row_id in enzymes: out.write(cells[row_id]) out.write(enzymes[row_id])和其他适当的记录一旦你打印出行中的行就会删除它们，这会减少内存的使用。

我希望这会有所帮助： - ）

Answer 2

我用来在Python中快速处理海量文件的一种技术是使用multiprocessing library将文件拆分成大块，并在工作子进程中并行处理这些块。

这是一般算法：

根据您将在运行此脚本的系统上提供的内存量，确定您可以一次读入内存的文件数量。我们的目标是尽可能地扩大块，而不会导致thrashing。
将文件名和块开始/结束位置传递给子进程，子进程将打开文件，读入并处理文件的各个部分，并返回结果。

具体来说，我喜欢使用多处理池，然后创建块开始/停止位置列表，然后使用pool.map()函数。这将阻止，直到每个人都完成，如果你从地图调用中捕获返回值，每个子进程的结果将可用。

例如，您可以在子流程中执行以下操作：

# assume we have passed in a byte position to start and end at, and a file name:

with open("fi_name", 'r') as fi:
    fi.seek(chunk_start)
    chunk = fi.readlines(chunk_end - chunk_start)

retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}

for line in chunk:

    ID, category = retriever(line.split())
    try:
        APPIDs[ID].append(category) # we've seen this ID before, add category to its list
    except KeyError:
        APPIDs[ID] = [category] # we haven't seen this ID before - make an entry

# APPIDs entries will look like this:
# 
# <APPID> : [list of categories]

return APPIDs

在主进程中，您将检索所有返回的词典并解决重复或重叠，然后输出如下内容：

for ID, categories in APPIDs.iteritems():
    if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
         # print or whatever

一些注意事项/注意事项：

注意硬盘/ SSD /数据所在位置的负载。如果您当前的方法已经超出其吞吐量，您可能无法从中看到任何性能改进。您可以尝试使用threading实现相同的算法。
如果您的硬盘负载过重而不是由于内存抖动，您还可以减少您在池中允许的同时子进程数。这将减少对驱动器的读取请求，同时仍然利用真正的并行处理。
查找可以利用的输入数据中的模式。例如，如果您可以依赖匹配的APPID彼此相邻，那么您实际上可以在子进程中进行所有比较，并让主进程挂起，直到组合子进程数据结构为止。

TL; DR

将文件分成多个块，并与多处理库并行处理它们。

提高Python匹配的效率

2 个答案:

有效地阅读文件

集

之后呢？

我们去的时候减少了设置。

使用`if row_id in ids_to_keep: out.write(line)`代替集合：

TL; DR

提高Python匹配的效率

2 个答案:

有效地阅读文件

集

之后呢？

我们去的时候减少了设置。

使用if row_id in ids_to_keep: out.write(line) 代替集合：

TL; DR

使用`if row_id in ids_to_keep: out.write(line)`代替集合：