我想读取一个文件,并创建一个字典,每个单词作为一个键,并将其后面的单词作为值。
例如,如果我有一个包含以下内容的文件:
'Cake is cake okay.'
创建的字典应包含:
{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}
到目前为止,我已经设法与我的代码做了相反的事情。我已使用文件中的上一个单词更新了字典值。我不太确定如何更改它以使其按预期工作。
def create_dict(file):
word_dict = {}
prev_word = ''
for line in file:
for word in line.lower().split():
clean_word = word.strip(string.punctuation)
if clean_word not in word_dict:
word_dict[clean_word] = []
word_dict[clean_word].append(prev_word)
prev_word = clean_word
提前谢谢你们的帮助!
修改
更新了进度:
def create_dict(file):
word_dict = {}
next_word = ''
for line in file:
formatted_line = line.lower().split()
for word in formatted_line:
clean_word = word.strip(string.punctuation)
if next_word != '':
if next_word not in word_dict:
word_dict[next_word] = []
if clean_word == '':
clean_word.
next_word = clean_word
return word_dict
答案 0 :(得分:1)
您可以使用itertools.zip_longest()和dict.setdefault()获得更短的解决方案:
import io
from itertools import zip_longest # izip_longest in Python 2
import string
def create_dict(fobj):
word_dict = {}
punc = string.punctuation
for line in fobj:
clean_words = [word.strip(punc) for word in line.lower().split()]
for word, next_word in zip_longest(clean_words, clean_words[1:]):
words = word_dict.setdefault(word, [])
if next_word is not None:
words.append(next_word)
return word_dict
测试它:
>>> fobj = io.StringIO("""Cake is cake okay.""")
>>> create_dict(fobj)
{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}
答案 1 :(得分:0)
从创建bigram字典的代码(此问题的主题)中分离从给定文件生成单词的代码(在空格上拆分,大小写折叠,剥离标点符号等):
#!/usr/bin/env python3
from collections import defaultdict
from itertools import tee
def create_bigram_dict(words):
a, b = tee(words) # itertools' pairwise recipe
next(b)
bigrams = defaultdict(list)
for word, next_word in zip(a, b):
bigrams[word].append(next_word)
bigrams[next_word] # last word may have no following words
return bigrams
见itertools' pairwise()
recipe。要在文件中支持少于两个单词,代码需要进行少量调整。如果您需要确切类型,可以在此处拨打return dict(bigrams)
。例如:
>>> create_bigram_dict('cake is cake okay'.split())
defaultdict(list, {'cake': ['is', 'okay'], 'is': ['cake']}
要从文件创建dict,您可以定义get_words(file)
:
#!/usr/bin/env python3
import regex as re # $ pip install regex
def get_words(file):
with file:
for line in file:
words = line.casefold().split()
for w in words:
yield re.fullmatch(r'\p{P}*(.*?)\p{P}*', w).group(1)
用法:create_bigram_dict(get_words(open('filename')))
。
To strip Unicode punctuation, \p{P}
regex is used。代码可以保留里面的标点符号,例如:
>>> import regex as re
>>> re.fullmatch(r'\p{P}*(.*?)\p{P}*', "doesn't.").group(1)
"doesn't"
注意:末尾的点已消失,但保留了'
。要删除所有标点符号,可以使用s = re.sub(r'\p{P}+', '', s)
:
>>> re.sub(r'\p{P}+', '', "doesn't.")
'doesnt'
注意:单引号也消失了。