我正在建立一个NLP模型来预测R中的下一个单词。 因此,对于3个句子语料库:
< .S>我喜欢奶酪< ./ s>
< .S>像猫一样的狗< ./ s>
< .S>猫吃奶酪< ./ s>
for a bigram model:
p(i|<.s>)= 1/3
p(the|<.s>)=2/3
p(like|i)=1/1
p(cheese|like)=1/2
p(cat|like)=1/2
p(cat|the)=1/2
p(dog|the)=1/2
p(like|dog)=1/1
P(<./s>||cat)=1/2
p(eat|cat)=1/2
p(cheese|eat)=1/1
p(<./s>|cheese)=2/2
所以我的模型会预测:
the (2/3) cat (1/2) eat (1/2) cheese (1/1) <./s> (2/2)
or the (2/3) cat (1/2) </.s>
or the (2/3) dog (1/2) like (1/1) cat(1/2) </.s> (1/2)
or the (2/3) dog (1/2) like (1/1) cat (1/2) eat (1/2) cheese (1/1) <./s> (2/2)
or the (2/3) dog (1/2) like (1/1) cheese(1/2) <./s>
这就是我想要的,因为它正确地指定我用&#34;&#34;&#34; 2/3
Sor远远使用`tm,我可以让unigrams和bigrams来计算概率,但是开始这句话我只需要计算哪一个是最常见的Unigram
就我而言:
like = 2/11
cheese = 2/11
the = 2/11
cat = 2/11
dog = 1/11
eat = 1/11
i = 1/11
这会给我与#34;&#34;,&#34; cheese&#34; ,&#34; cat&#34; ,或者&#34;&#34;。
如何引入这些句子标记以获得R中句子开头的单词的准确预测?