Question

我正在建立一个NLP模型来预测R中的下一个单词。因此，对于3个句子语料库：

＆LT; .S＆GT;我喜欢奶酪＆lt; ./ s＆gt;

＆LT; .S＆GT;像猫一样的狗＆lt; ./ s＆gt;

＆LT; .S＆GT;猫吃奶酪＆lt; ./ s＆gt;

for a bigram model:
p(i|<.s>)= 1/3
p(the|<.s>)=2/3

p(like|i)=1/1

p(cheese|like)=1/2
p(cat|like)=1/2

p(cat|the)=1/2
p(dog|the)=1/2

p(like|dog)=1/1

P(<./s>||cat)=1/2
p(eat|cat)=1/2

p(cheese|eat)=1/1

p(<./s>|cheese)=2/2

所以我的模型会预测：

    the (2/3) cat (1/2) eat (1/2) cheese (1/1) <./s> (2/2)
or  the (2/3) cat (1/2) </.s>
or  the (2/3) dog (1/2) like (1/1) cat(1/2) </.s> (1/2)
or  the (2/3) dog (1/2) like (1/1) cat (1/2) eat (1/2) cheese (1/1) <./s> (2/2)
or the (2/3) dog (1/2) like (1/1) cheese(1/2) <./s>

这就是我想要的，因为它正确地指定我用＆＃34;＆＃34;＆＃34; 2/3

Sor远远使用`tm，我可以让unigrams和bigrams来计算概率，但是开始这句话我只需要计算哪一个是最常见的Unigram

就我而言：

like = 2/11
cheese = 2/11
the = 2/11
cat = 2/11
dog = 1/11
eat = 1/11
i = 1/11

这会给我与＃34;＆＃34;，＆＃34; cheese＆＃34; ，＆＃34; cat＆＃34; ，或者＆＃34;＆＃34;。

如何引入这些句子标记以获得R中句子开头的单词的准确预测？

使用R标记句子开始<s>和结束</s>以进行预测

0 个答案: