使用R标记句子开始<s>和结束</s>以进行预测

时间:2016-03-17 20:19:35

标签: r model nlp prediction

我正在建立一个NLP模型来预测R中的下一个单词。 因此,对于3个句子语料库:

&LT; .S&GT;我喜欢奶酪&lt; ./ s&gt;

&LT; .S&GT;像猫一样的狗&lt; ./ s&gt;

&LT; .S&GT;猫吃奶酪&lt; ./ s&gt;

for a bigram model:
p(i|<.s>)= 1/3
p(the|<.s>)=2/3

p(like|i)=1/1

p(cheese|like)=1/2
p(cat|like)=1/2

p(cat|the)=1/2
p(dog|the)=1/2

p(like|dog)=1/1

P(<./s>||cat)=1/2
p(eat|cat)=1/2

p(cheese|eat)=1/1

p(<./s>|cheese)=2/2

所以我的模型会预测:

    the (2/3) cat (1/2) eat (1/2) cheese (1/1) <./s> (2/2)
or  the (2/3) cat (1/2) </.s>
or  the (2/3) dog (1/2) like (1/1) cat(1/2) </.s> (1/2)
or  the (2/3) dog (1/2) like (1/1) cat (1/2) eat (1/2) cheese (1/1) <./s> (2/2)
or the (2/3) dog (1/2) like (1/1) cheese(1/2) <./s>

这就是我想要的,因为它正确地指定我用&#34;&#34;&#34; 2/3

Sor远远使用`tm,我可以让unigrams和bigrams来计算概率,但是开始这句话我只需要计算哪一个是最常见的Unigram

就我而言:

like = 2/11
cheese = 2/11
the = 2/11
cat = 2/11
dog = 1/11
eat = 1/11
i = 1/11

这会给我与#34;&#34;,&#34; cheese&#34; ,&#34; cat&#34; ,或者&#34;&#34;。

如何引入这些句子标记以获得R中句子开头的单词的准确预测?

0 个答案:

没有答案