我试图读取文件corpus.txt(训练集)并生成模型,输出必须被称为lexic.txt并包含单词,标记和出现次数...对于小训练设置它的工作原理,但对于大学给出的训练集(30mb .txt文件,数百万行),代码不起作用,我想这将是效率的问题,因此系统耗尽内存...可以有人帮我提供代码吗?
我在这里附上我的代码:
from collections import Counter
file=open('corpus.txt','r')
data=file.readlines()
file.close()
palabras = []
count_list = []
for linea in data:
linea.decode('latin_1').encode('UTF-8') # para los acentos
palabra_tag = linea.split('\n')
palabras.append(palabra_tag[0])
cuenta = Counter(palabras) # dictionary for count ocurrences for a word + tag
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag:
count_list.append([palabras[i], str(cuenta[palabraTag])])
#We delete repeated ones
finalList = []
for i in count_list:
if i not in finalList:
finalList.append(i)
outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')
for i in range(len(finalList)):
outfile.write(finalList[i][0]+'\t'+finalList[i][1]+'\n') # finalList[i][0] is the word + tag and finalList[i][1] is the numbr of ocurrences
outfile.close()
在这里你可以看到一个corpus.txt的样本:
Al Prep
menos Adv
cinco Det
reclusos Adj
murieron V
en Prep
las Det
últimas Adj
24 Num
horas NC
en Prep
las Det
cárceles NC
de Prep
Valencia NP
y Conj
Barcelona NP
en Prep
incidentes NC
en Prep
los Det
que Pron
su Det
提前致谢!
答案 0 :(得分:0)
如果将这两个代码组合在一起,就可以减少内存使用量。
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag:
count_list.append([palabras[i], str(cuenta[palabraTag])])
#We delete repeated ones
finalList = []
for i in count_list:
if i not in finalList:
finalList.append(i)
您可以检查计数列表中是否存在某个项目,并通过这样做,而不是首先添加重复项。这应该会减少你的内存使用量。见下文;
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag and
[palabras[i], str(cuenta[palabraTag])] not in count_list:
count_list.append([palabras[i], str(cuenta[palabraTag])])
答案 1 :(得分:0)
最后我使用字典改进了代码,这里的结果是100%正常工作:
file=open('corpus.txt','r')
data=file.readlines()
file.close()
diccionario = {}
for linea in data:
linea.decode('latin_1').encode('UTF-8') # para los acentos
palabra_tag = linea.split('\n')
cadena = str(palabra_tag[0])
if(diccionario.has_key(cadena)):
aux = diccionario.get(cadena)
aux += 1
diccionario.update({cadena:aux})
else:
diccionario.update({cadena:1})
outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')
for key, value in diccionario.iteritems() :
s = str(value)
outfile.write(key +" "+s+'\n')
outfile.close()