下面是我的功能
def loadData(self,datasetName,target_indx=None):
tokens =[]
labels = []
ids = []
allLabels = set()
with open(self.dirPathName+datasetName+".csv",mode="r",encoding="UTF-8") as f:
for line in f:
id,t, l = self.processLine(line)
tokens.append(t)
labels.append(l)
ids.append(id)
allLabels = allLabels.union(set(l))
labelsList = list(allLabels)
labelsList.sort()
# print ("getting targets")
targets = self.getTargets(labelsList,labels)
#print ("finished getting targets")
docs = []
#print "read: " + str(len(targets))
for i in range(len(tokens)):
doc = self.make_document(ids[i],tokens[i],labels[i],targets[i],target_indx=target_indx)
docs.append(doc)
return docs
上面的代码打开供处理的CSV文件为700MB。它包含约130000行。
我的问题在于上图中突出显示的代码段。 len(令牌)的最大值为75000。因此for循环将执行多次。现在 self.make文档函数将处理某些代码,然后 结果将被添加到名为 docs 的列表中,如图像的突出显示部分所示。这导致我的记忆冻结。
这是make_document代码
def make_document(self,docid, tokens, labels, target,target_indx=None):
"""Return Document object initialized with given token texts."""
#Here tokens is tokenized version of each line of text. So for first line it is ['Angiosperm','is','a']
#Here labels is labels assigned to that document i.e.['7','71']
#target is of shape length of unique labels. In our case 37. It is a numpy array with all zeros except where '7' is located
#in the unique labelsdictionary and '71' is located in the unique labels dictionary
#This gives data in the following format:- [Token('Angiosperm'),Token('is'),Token('a')]
tokens = [Token(t) for t in tokens]
# We don't have sentence splitting, but the data structure expects
# Documents to contain Sentences which in turn contain Tokens.
# Create a dummy sentence containing all document tokens to work
# around this constraint.
#Create a sentence object
sentences = [Sentence(tokens=tokens)]
#Create a document object
doc = Document(id=docid,target_idx=target_indx,target_str=str(labels), sentences=sentences)
doc.set_target(target)
return doc
如您所见,此代码调用Document类,然后返回一个对象。对象快照如下:
我知道,如果您编写效率低下的代码,那么大内存并不重要,这就是发生了的事情。有人可以给我建议吗?