如何找到人物双字母和三元组?

时间:2018-04-24 14:46:53

标签: python python-3.x machine-learning classification nltk

问题:

找到一个domain_name的bigrams,trigrams和bigram_score。我有一个数据集,我想区分它们是否是dga域或不使用一些简单的分类。所以我想用bigrams,trigrams和entropy开始。

我尝试了什么:

from nltk import ngrams
sentence = 'some big sentence'
n = 2
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams

在这里,我得到了一句话。但我强调的不是这个。

我想转换

示例域名:google.co.in

bigrams

[‘$g’, ‘go’, ‘oo’, ‘og’, ‘gl’, ‘le’, ‘e$’, ‘$c’, ‘co’, ‘o$’, ‘$i’, ‘in’, ‘n$’]
trigrams

[‘$go’, ‘goo’, ‘oog’, ‘ogl’, ‘gle’, ‘le$’, ‘$co’, ‘co$’, ‘$in’, ‘in$’]

然后计算bigrams_score.From,我可以将它用于预测模块和分析。

有谁能帮助我了解如何解决问题?

1 个答案:

答案 0 :(得分:2)

>>> from nltk import word_tokenize, ngrams
>>> s = "foo bar sentence"

# Word ngrams.
>>> list(ngrams(word_tokenize(s), 2))
[('foo', 'bar'), ('bar', 'sentence')]

# Character ngrams.
>>> list(ngrams(s, 2))
[('f', 'o'), ('o', 'o'), ('o', ' '), (' ', 'b'), ('b', 'a'), ('a', 'r'), ('r', ' '), (' ', 's'), ('s', 'e'), ('e', 'n'), ('n', 't'), ('t', 'e'), ('e', 'n'), ('n', 'c'), ('c', 'e')]