我有一个文本文件,格式如下:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
......有100万件物品
但是有些word_forms包含一个撇号('),有些则没有,所以我想将它们视为同一个单词的实例,也就是说我想合并这两行:
cup'board cup blabla 12
cupboard cup blabla2 10
进入这一个(频率增加):
cupboard cup blabla2 22
我在Python 2.7中搜索一个解决方案来做到这一点,我的第一个想法是阅读文本文件,在两个不同的词典中存储带撇号和单词的单词,然后用撇号翻阅单词词典,测试是否这些单词已经在字典中没有撇号,如果它们实现了频率,如果不是简单地添加这行并删除撇号。这是我的代码:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
这个解决方案在合理的时间内工作,直到存储两个字典,无论我是逐行写入两个文本文件以避免任何存储,还是将它们存储为程序中的dict对象。但两个字典的合并永远不会结束!
字典的“更新”功能可以工作但覆盖一个频率计数而不是添加两个频率计数。我看到了一些合并字典的解决方案 加上Counter: Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? 但它们似乎只在词典的形式(单词,计数)时起作用,而我也想在字典中携带其他字段。
我愿意接受你的所有想法或重新解决问题,因为我的目标是 让这个程序运行一次只是为了在文本文件中获取这个合并列表,谢谢你提前!
答案 0 :(得分:0)
这里的东西或多或少都符合你的要求。只需更改顶部的文件名即可。它不会修改原始文件。
input_file_name = "input.txt"
output_file_name = "output.txt"
def custom_comp(s1, s2):
word1 = s1.split()[0]
word2 = s2.split()[0]
stripped1 = word1.translate(None, "'")
stripped2 = word2.translate(None, "'")
if stripped1 > stripped2:
return 1
elif stripped1 < stripped2:
return -1
else:
if "'" in word1:
return -1
else:
return 1
def get_word(line):
return line.split()[0].translate(None, "'")
def get_num(line):
return int(line.split()[-1])
print "Reading file and sorting..."
lines = []
with open(input_file_name, 'r') as f:
for line in sorted(f, cmp=custom_comp):
lines.append(line)
print "File read and sorted"
combined_lines = []
print "Combining entries..."
i = 0
while i < len(lines) - 1:
if get_word(lines[i]) == get_word(lines[i+1]):
total = get_num(lines[i]) + get_num(lines[i+1])
new_parts = lines[i+1].split()
new_parts[-1] = str(total)
combined_lines.append(" ".join(new_parts))
i += 2
else:
combined_lines.append(lines[i].strip())
i += 1
print "Entries combined"
print "Writing to file..."
with open(output_file_name, 'w+') as f:
for line in combined_lines:
f.write(line + "\n")
print "Finished"
它对单词进行排序并稍微混淆了间距。如果这很重要,请告诉我,并且可以进行调整。
另一件事是它整个事情的分类。只有一百万行,可能不会花费太长时间,但请再次告诉我,如果这是一个问题。