Question

我想读取一个文件，并创建一个字典，每个单词作为一个键，并将其后面的单词作为值。

例如，如果我有一个包含以下内容的文件：

'Cake is cake okay.'

创建的字典应包含：

{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}

到目前为止，我已经设法与我的代码做了相反的事情。我已使用文件中的上一个单词更新了字典值。我不太确定如何更改它以使其按预期工作。

def create_dict(file):

    word_dict = {}
    prev_word = ''

    for line in file:

        for word in line.lower().split():
            clean_word = word.strip(string.punctuation)

            if clean_word not in word_dict:
                word_dict[clean_word] = []

            word_dict[clean_word].append(prev_word)
            prev_word = clean_word

提前谢谢你们的帮助！

修改

更新了进度：

def create_dict(file):
    word_dict = {}
    next_word = ''

    for line in file:
        formatted_line = line.lower().split()

        for word in formatted_line:
            clean_word = word.strip(string.punctuation)

            if next_word != '':
                if next_word not in word_dict:
                    word_dict[next_word] = []

            if clean_word == '':
                clean_word.

            next_word = clean_word
    return word_dict

Answer 1

您可以使用itertools.zip_longest()和dict.setdefault()获得更短的解决方案：

import io
from itertools import zip_longest  # izip_longest in Python 2
import string

def create_dict(fobj):
    word_dict = {}
    punc = string.punctuation
    for line in fobj:
        clean_words = [word.strip(punc) for word in line.lower().split()]
        for word, next_word in zip_longest(clean_words, clean_words[1:]):
            words = word_dict.setdefault(word, [])
            if next_word is not None:
                words.append(next_word)
    return word_dict

测试它：

>>> fobj = io.StringIO("""Cake is cake okay.""")
>>> create_dict(fobj)
{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}

Answer 2

从创建bigram字典的代码（此问题的主题）中分离从给定文件生成单词的代码（在空格上拆分，大小写折叠，剥离标点符号等）：

#!/usr/bin/env python3
from collections import defaultdict
from itertools import tee

def create_bigram_dict(words):
    a, b = tee(words) # itertools' pairwise recipe
    next(b)
    bigrams = defaultdict(list)
    for word, next_word in zip(a, b):  
        bigrams[word].append(next_word)
    bigrams[next_word] # last word may have no following words
    return bigrams

见itertools' pairwise() recipe。要在文件中支持少于两个单词，代码需要进行少量调整。如果您需要确切类型，可以在此处拨打return dict(bigrams)。例如：

>>> create_bigram_dict('cake is cake okay'.split())
defaultdict(list, {'cake': ['is', 'okay'], 'is': ['cake']}

要从文件创建dict，您可以定义get_words(file)：

#!/usr/bin/env python3
import regex as re  # $ pip install regex

def get_words(file):
    with file:
        for line in file:
            words = line.casefold().split()
            for w in words:
                yield re.fullmatch(r'\p{P}*(.*?)\p{P}*', w).group(1)

用法：create_bigram_dict(get_words(open('filename')))。

To strip Unicode punctuation, \p{P} regex is used。代码可以保留里面的标点符号，例如：

>>> import regex as re >>> re.fullmatch(r'\p{P}*(.*?)\p{P}*', "doesn't.").group(1) "doesn't"

注意：末尾的点已消失，但保留了'。要删除所有标点符号，可以使用s = re.sub(r'\p{P}+', '', s)：

>>> re.sub(r'\p{P}+', '', "doesn't.") 'doesnt'

注意：单引号也消失了。

使用文件中的下一个单词更新字典值？

2 个答案: