在R中调用StemCompletion和PlainTextDocument损坏文本

时间:2016-07-26 18:05:24

标签: r data-manipulation

给定一个文本语料库,想要在R中使用tm (Text Mining) package进行词干和词干完成来规范化术语,但是,stemCompletion步骤在0.6.x版本的包中存在问题。使用R 3.3.1和tm 0.6-2。

之前已经问过这个问题,但还没有看到真正有效的完整答案。以下是正确演示该问题的完整代码。

(Day/Time/Duration/Cost)

Mo 1330 16 $6.40

Mo 815  35 $14.00

Tu 750  20 $3.00

We 1745 30 $12.00

Th 800  45 $18.00

Su 2350 30 $4.50

这是输出:

#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>

using namespace std;

const double DAYTIME = 0.40;
const double NIGHT = 0.25;
const double WEEKEND = 0.15;

int main()
{
    ifstream fin;
    fin.open("Call_History.txt");

    string day;
    int time;
    int duration;
    int dayOfWeek;
    double cost;
    double total;

    // Set the numeric output formatting.
    cout << fixed << showpoint << setprecision(2);

    cout << "Day Time Duration Cost\n" << endl;

    while (fin >> day >> time >> duration)
    {
        if (day == "Mo")
        {
            dayOfWeek = 1;
        }
        else if (day == "Tu")
        {
            dayOfWeek = 2;
        }
        else if (day == "We")
        {
            dayOfWeek = 3;
        }
        else if (day == "Th")
        {
            dayOfWeek = 4;
        }
        else if (day == "Fr")
        {
            dayOfWeek = 5;
        }
        else if (day == "Sa")
        {
            dayOfWeek = 6;
        }
        else if (day == "Su")
        {
            dayOfWeek = 7;
        }

        // Determine cost of call based on rate schedule.
        if ((time >= 800) && (time <= 1800) && (dayOfWeek <= 5))
        {
            cost = duration * DAYTIME;
        }
        else if ((time < 800) && (time > 1800) && (dayOfWeek <= 5))
        {
            cost = duration * NIGHT;
        }
        else
        {
            cost = duration * WEEKEND;
        }
        cout << day << " " << time << " " << duration << " $" << cost << endl;

    }

    cout << "\nTotal $" << endl;

    return 0;
}

有几个术语已被阻止:&#34; modifi&#34;,&#34;删除&#34;,&#34; subum&#34;,&#34;典型&#34;和&# 34; ONC&#34;

接下来,想要完成干预。

 require(tm)
 txt <- c("Once we have a corpus we typically want to modify the documents in it",
          "e.g., stemming, stopword removal, et cetera.",
          "In tm, all this functionality is subsumed into the concept of a transformation.")

 myCorpus <- Corpus(VectorSource(txt))

 myCorpus <- tm_map(myCorpus, content_transformer(tolower))
 myCorpus <- tm_map(myCorpus, removePunctuation)
 myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
 myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
 myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

 tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
 print(tdm)
 print(dimnames(tdm)$Terms)

在此阶段,语料库不再是TextDocument,并且创建TermDocumentMatrix失败并显示错误:inherits(doc,&#34; TextDocument&#34;)不为TRUE。已记录下来应用<<TermDocumentMatrix (terms: 19, documents: 2)>> Non-/sparse entries: 20/18 Sparsity : 47% Maximal term length: 9 Weighting : term frequency (tf) [1] "all" "cetera" "concept" "corpus" "document" [6] "function" "have" "into" "modifi" "onc" [11] "remov" "stem" "stopword" "subsum" "the" [16] "this" "transform" "typic" "want" 函数。

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

这是输出:

PlainTextDocument()

调用myCorpus <- tm_map(myCorpus, PlainTextDocument) tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf))) print(tdm) print(dimnames(tdm)$Terms) 已损坏语料库。

期待完成词干:例如&#34; modifi&#34; =&GT; &#34;修饰符&#34;,&#34; onc&#34; =&GT; &#34;一次&#34;等等

2 个答案:

答案 0 :(得分:3)

致电PlainTextDocument并没有破坏语料库。

你可能已经注意到当你跑线

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

你收到了几条警告信息:

Warning messages:
1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

那些值得一提;)

这是如何使用您的数据进行干完成干预:

txt <- c("Once we have a corpus we typically want to modify the documents in it",
         "e.g., stemming, stopword removal, et cetera.",
         "In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm      <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))
          stems       completed       
all       "all"       "all"           
cetera    "cetera"    "cetera"        
concept   "concept"   "concept"       
corpus    "corpus"    "corpus"        
document  "document"  "documents"     
function  "function"  "functionality" 
have      "have"      "have"          
into      "into"      "into"          
modifi    "modifi"    "modify"              
onc       "onc"       "once"          
remov     "remov"     "removal"       
stem      "stem"      "stemming"      
stopword  "stopword"  "stopword"      
subsum    "subsum"    "subsumed"      
the       "the"       "the"           
this      "this"      "this"          
transform "transform" "transformation"
typic     "typic"     "typically"     
want      "want"      "want"

要将更改永久写回TDM:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
                                                         dictionary=dict, type="shortest"),sep="", 
                                          collapse=" ")))}

tdm <- stemCompletion_mod(rownames(tdm), myCorpus)  


tdm$content
  

[1]&#34;所有cetera概念语料库文档功能都已纳入NA   一旦删除干扰词,就会包含这种转变   通常需要&#34;

答案 1 :(得分:1)

关于Hack-R的解决方案,我和Jason有同样的问题,我希望在词云中使用“StemCompleted”字样,并作为TDM的一部分。

由于stemCompletion没有返回TDM,我从TDM中提取了“术语”,然后在那里运行了stemCompletion。

(我在测试时把这些变成了一个单独的变量)

require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
      "e.g., stemming, stopword removal, et cetera.",
      "In tm, all this functionality is subsumed into the concept of a transformation.")

myCorpus <- Corpus(VectorSource(txt))

myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

给出这个输出:

 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"     

由于stemCompletion似乎返回了一个字符表,我只是将'tdm'的术语部分替换为stemCompleted版本:

tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)

这给了我:

 [1] "all"            "cetera"         "concept"        "corpus"        
 [5] "documents"      "functionality"  "have"           "into"          
 [9] ""               "once"           "removal"        "stemming"      
[13] "stopword"       "subsumed"       "the"            "this"          
[17] "transformation" "typically"      "want"          

显然,你会在不知道该怎么做的单词(“modifi”)上找到空白字段,但至少这次你可以使用stemCompleted版本......