从大的.text文件生成模型读取语料库时出错

时间:2015-03-01 00:56:36

标签: python machine-learning nlp tagged-corpus

我试图读取文件corpus.txt(训练集)并生成模型,输出必须被称为lexic.txt并包含单词,标记和出现次数...对于小训练设置它的工作原理,但对于大学给出的训练集(30mb .txt文件,数百万行),代码不起作用,我想这将是效率的问题,因此系统耗尽内存...可以有人帮我提供代码吗?

我在这里附上我的代码:

from collections import Counter

file=open('corpus.txt','r')
data=file.readlines()
file.close()

palabras = []
count_list = []

for linea in data:
   linea.decode('latin_1').encode('UTF-8') # para los acentos
   palabra_tag = linea.split('\n')
   palabras.append(palabra_tag[0])

cuenta = Counter(palabras) # dictionary for count ocurrences for a word + tag 

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag:       
            count_list.append([palabras[i], str(cuenta[palabraTag])])


#We delete repeated ones
finalList = []
for i in count_list:
    if i not in finalList:
        finalList.append(i)


outfile = open('lexic.txt', 'w') 
outfile.write('Palabra\tTag\tApariciones\n')

for i in range(len(finalList)):
    outfile.write(finalList[i][0]+'\t'+finalList[i][1]+'\n') # finalList[i][0] is the word + tag and finalList[i][1] is the numbr of ocurrences

outfile.close()

在这里你可以看到一个corpus.txt的样本:

Al  Prep
menos   Adv
cinco   Det
reclusos    Adj
murieron    V
en  Prep
las Det
últimas Adj
24  Num
horas   NC
en  Prep
las Det
cárceles    NC
de  Prep
Valencia    NP
y   Conj
Barcelona   NP
en  Prep
incidentes  NC
en  Prep
los Det
que Pron
su  Det

提前致谢!

2 个答案:

答案 0 :(得分:0)

如果将这两个代码组合在一起,就可以减少内存使用量。

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag:       
            count_list.append([palabras[i], str(cuenta[palabraTag])])


#We delete repeated ones
finalList = []
for i in count_list:
    if i not in finalList:
        finalList.append(i) 

您可以检查计数列表中是否存在某个项目,并通过这样做,而不是首先添加重复项。这应该会减少你的内存使用量。见下文;

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag and
           [palabras[i], str(cuenta[palabraTag])] not in count_list:
                count_list.append([palabras[i], str(cuenta[palabraTag])])

答案 1 :(得分:0)

最后我使用字典改进了代码,这里的结果是100%正常工作:

file=open('corpus.txt','r')
data=file.readlines()
file.close()

diccionario = {}

for linea in data:
    linea.decode('latin_1').encode('UTF-8') # para los acentos
    palabra_tag = linea.split('\n')
    cadena = str(palabra_tag[0])
    if(diccionario.has_key(cadena)):
        aux = diccionario.get(cadena)
        aux += 1
        diccionario.update({cadena:aux})
    else:
        diccionario.update({cadena:1})

outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')

for key, value in diccionario.iteritems() :
    s = str(value)
    outfile.write(key +" "+s+'\n')
outfile.close()