Question

下面是我的功能

def loadData(self,datasetName,target_indx=None):

    tokens =[]
    labels = []
    ids = []
    allLabels = set()
    with open(self.dirPathName+datasetName+".csv",mode="r",encoding="UTF-8") as f:
        for line in f:
            id,t, l = self.processLine(line)
            tokens.append(t)
            labels.append(l)
            ids.append(id)
            allLabels = allLabels.union(set(l))
    labelsList = list(allLabels)
    labelsList.sort()
   # print ("getting targets")
    targets = self.getTargets(labelsList,labels)
    #print ("finished getting targets")
    docs = []
    #print "read: " + str(len(targets))
    for i in range(len(tokens)):
        doc = self.make_document(ids[i],tokens[i],labels[i],targets[i],target_indx=target_indx)
        docs.append(doc)
    return docs

Highlighted piece of code

上面的代码打开供处理的CSV文件为700MB。它包含约130000行。

我的问题在于上图中突出显示的代码段。 len（令牌）的最大值为75000。因此for循环将执行多次。现在 self.make文档函数将处理某些代码，然后结果将被添加到名为 docs 的列表中，如图像的突出显示部分所示。这导致我的记忆冻结。

这是make_document代码

def make_document(self,docid, tokens, labels, target,target_indx=None):
        """Return Document object initialized with given token texts."""
        #Here tokens is tokenized version of each line of text. So for first line it is ['Angiosperm','is','a']
        #Here labels is labels assigned to that document i.e.['7','71']
        #target is of shape length of unique labels. In our case 37. It is a numpy array with all zeros except where '7' is located 
        #in the unique labelsdictionary and '71' is located in the unique labels dictionary


        #This gives data in the following format:- [Token('Angiosperm'),Token('is'),Token('a')]
        tokens = [Token(t) for t in tokens]
        # We don't have sentence splitting, but the data structure expects
        # Documents to contain Sentences which in turn contain Tokens.
        # Create a dummy sentence containing all document tokens to work
        # around this constraint.

        #Create a sentence object
        sentences = [Sentence(tokens=tokens)]
        #Create a document object
        doc =  Document(id=docid,target_idx=target_indx,target_str=str(labels), sentences=sentences)
        doc.set_target(target)
        return doc

如您所见，此代码调用Document类，然后返回一个对象。对象快照如下：

Document object

我知道，如果您编写效率低下的代码，那么大内存并不重要，这就是发生了的事情。有人可以给我建议吗？

list.append导致内存效率低下

0 个答案: