Python中的单词频率而不是列表

时间:2018-10-21 15:38:33

标签: python keras

我有以下代码:

from keras.preprocessing import text

with open('engl_bible.txt', 'r') as file:
norm_bible = file

tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(norm_bible)

word2id = tokenizer.word_index
id2word = {v:k for k, v in word2id.items()}

vocab_size = len(word2id) + 1
embed_size = 100

wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_bible]

print('Vocabulary Size:', vocab_size)

print('Vocabulary Sample:',list(word2id.items())[:10])

这将导致以下输出:

Vocabulary Size: 3847
Vocabulary Sample: [('and', 1), ('the', 2), ('to', 3), ('of', 4), ('you', 5), ('he', 6), ('in', 7), ('a', 8), ('is', 9), ('him', 10)]

但是它应该创建与之相当的东西(因此,排名不是1到10,而是带有单词频率):

Vocabulary Size: 12425
Vocabulary Sample: [('perceived', 1460), ('flagon', 7287), ('gardener', 11641), ('named', 973), ('remain', 732), ('sticketh', 10622), ('abstinence', 11848), ('rufus', 8190), ('adversary', 2018), ('jehoiachin', 3189)]

我真的不知道出了什么问题,希望您能帮助我!非常感谢!

1 个答案:

答案 0 :(得分:2)

如果您想要单词频率,则需要使用tokenizer.word_counts而不是tokenizer.word_index。 因此代码将是:

from keras.preprocessing import text

with open('engl_bible.txt', 'r') as file:
    norm_bible = file

    tokenizer = text.Tokenizer()
    tokenizer.fit_on_texts(norm_bible)

    word2id = tokenizer.word_counts
    id2word = {v:k for k, v in word2id.items()}

    vocab_size = len(word2id) + 1
    embed_size = 100

    wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_bible]

    print('Vocabulary Size:', vocab_size)
    print('Vocabulary Sample:', list(word2id.items())[:10])

请注意,尽管现在word2id并不是word2id,而是word2frequency ...