防止tm从双字中删除停用词

时间:2017-08-18 06:04:47

标签: r tm corpus stop-words

我试图从字符向量中删除停用词。但我面临的问题是,有一个词是“国王”,“王”,“王”等。是一个停顿词,"国王" in" king kong"正在被删除。

有没有办法避免删除双字? 我的代码是:

text <- VCorpus(VectorSource(newmnt1$form)) 
#(newmnt1$form is  chr [1:4] "king kong lives" "foot" "island" "skull")

#Normal standardization of text.
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, custom_stopwords)
text <- tm_map(text, stripWhitespace)
newmnt2 <- text[[1]]$content

2 个答案:

答案 0 :(得分:1)

一个快速的黑客将是转换你的&#34; king kong&#34;模式到&#34; king_kong&#34;。

a <- gsub("king kong", "king_kong", "This is a pattern with king and king kong")
a
[1] "This is a pattern with king and king_kong"

tm::removeWords(a, "king")
[1] "This is a pattern with  and king_kong"

最好,

科林

答案 1 :(得分:0)

如果您愿意使用其他套餐,则可行:

$.each(data, function () {

    if (prevGroupName.indexOf(this.Group.Name) == -1) {
        $prevGroup = $('<optgroup />').prop('label', this.Group.Name).appendTo('#ConfigurationId');
    } else {
        $prevGroup = $("optgroup[label='" + this.Group.Name + "']");
    }
    $("<option />").val(this.Value).text(this.Text).appendTo($prevGroup);
    prevGroupName.push(this.Group.Name);
});

> text <- c("king kong lives", "foot", "island", "skull", "This is a pattern with king and king kong") > corpus::term_matrix(text, drop = "king", combine = "king kong", transpose = TRUE) 11 x 5 sparse Matrix of class "dgCMatrix" a . . . . 1 and . . . . 1 foot . 1 . . . is . . . . 1 island . . 1 . . king kong 1 . . . 1 lives 1 . . . . pattern . . . . 1 skull . . . 1 . this . . . . 1 with . . . . 1 参数指示语料库combine解释为单个标记。