尝试将单词列表与R

时间:2016-09-12 11:49:24

标签: r sentiment-analysis sentimentr

我正在尝试将单词列表与句子列表进行匹配,并使用匹配的单词和句子形成数据框。例如:

words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")

预期结果(数据帧)如下:

sentences                                               words
This document is far better                               better
This is a great app                                       great
The night skies were sombre and starless                  sombre 
The app is too good and i am happy using it               good, happy
This is how it works                                      -

我使用以下代码来实现此目的。

lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
        sentence <- sentence_df[i,]$comment
        #sentence <- gsub('[[:punct:]]', "", sentence)
        #sentence <- gsub('[[:cntrl:]]', "", sentence)
        #sentence <- gsub('\\d+', "", sentence)
        sentence <- tolower(sentence)
        # get  unigrams  from the sentence
        unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

        # get bigrams from the sentence
        bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

        # .. and combine into data frame
        words <- c(unigrams, bigrams)
        #if(sentence_df[i,]$ave_sentiment)

        pos.matches <- match(words, pos.words)
        neg.matches <- match(words, neg.words)
        pos.matches <- na.omit(pos.matches)
        neg.matches <- na.omit(neg.matches)
        positiveList <- pos.words[pos.matches]
        negativeList <- neg.words[neg.matches]

        if(length(positiveList)==0){
          positiveList <- c("-")
        }
        if(length(negativeList)==0){
          negativeList <- c("-")
        }
        negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
        positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")

        positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
        negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))

    }    
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)

我有28k句和65k字匹配。上面的代码需要45秒才能完成任务。有关如何提高代码性能的任何建议都需要花费大量时间吗?

修改

我想只得到那些与句子中的单词完全匹配的单词。例如:

words <- c('sin','vice','crashes') 
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')

现在针对上述情况,我的输出应该如下:

sentences                                                           words
Since the app crashes frequently, I advice you guys to fix        crahses
the issue ASAP  

2 个答案:

答案 0 :(得分:1)

我能够使用@David Arenburg的答案进行一些修改。这就是我做的。我使用以下(由David建议)来形成数据框。

           if(!($stmtUpdate = $con->prepare("UPDATE user SET avatar = ? WHERE user_name = ?"))) {
        echo "Prepare failed: (" . $con->errno . ")" . $con->error;
    }
        if(!($stmtInsert = $con->prepare("INSERT INTO user ( avatar ) VALUES ( ? )"))) {
        echo "Prepare failed: (" . $con->errno . ")" . $con->error;
    } 
        if(!($stmtSelect = $con->prepare("SELECT * FROM user WHERE user_name = ? "))) {
        echo "Prepare failed: (" . $con->errno . ")" . $con->error;
    }        
        if(!$stmt->bind_param('sss', $temp, $NewImageName, $temp)) {
      echo "Binding paramaters failed:(" . $stmt->errno . ")" . $stmt->error;
    }      
        if(!$stmt->execute()){
             echo "Execute failed: (" . $stmt->errno .")" . $stmt->error;
    }

    $stmt->store_result();  
    if($stmt->num_rows == 0) {
           if(!empty($_FILES['ImageFile']['name'])){
                    $con->prepare($stmtUpdate)or die(mysqli_error($con));
                    header("location:edit-profile.php?user_name=$temp");
             exit;
                }
            } else {
        $stmt->bind_result($avatar, $avatar, $temp);
        $stmt->fetch();
          header("location:edit-profile.php?user_name=$temp");
        }

   $stmt->close();

上述方法的问题在于它没有完全匹配单词。 因此,我使用以下内容过滤掉与句子中的单词不完全匹配的单词。

df <- data.frame(sentences) ; 
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))

应用上述行后,输出数据帧会发生如下变化。

df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))

现在将以下过滤器应用于数据框,以删除与句子中出现的字词不完全匹配的字词。

sentences                                                      words
This document is far better                                    better
This is a great app                                            great
The night skies were sombre and starless                       sombre 
The app is too good and i am happy using it                    good
The app is too good and i am happy using it                    happy
This is how it works                                            -
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 crahses
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 vice
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 sin

现在我的结果数据框如下。

df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]

stri_detect_fixed减少了我的计算时间。剩下的过程并没有花费太多时间。感谢@David指出我正确的方向。

答案 1 :(得分:0)

您可以使用extract_sentiment_terms在最新版本的 sentimentr 中执行此操作,但您必须首先制作情感键并为字词指定值:

pos <- c("far better","good","great","sombre","happy")
neg <- c('sin','vice','crashes') 

sentences <- c('Since the app crashes frequently, I advice you guys to fix the issue ASAP',
    "This document is far better", "This is a great app","The night skies were sombre and starless", 
    "The app is too good and i am happy using it", "This is how it works")

library(sentimentr)
(sentkey <- as_key(data.frame(c(pos, neg), c(rep(1, length(pos)), rep(-1, length(neg))), stringsAsFactors = FALSE)))

##             x  y
## 1:    crashes -1
## 2: far better  1
## 3:       good  1
## 4:      great  1
## 5:      happy  1
## 6:        sin -1
## 7:     sombre  1
## 8:       vice -1

extract_sentiment_terms(sentences, sentkey)

##    element_id sentence_id negative   positive
## 1:          1           1  crashes           
## 2:          2           1          far better
## 3:          3           1               great
## 4:          4           1              sombre
## 5:          5           1          good,happy
## 6:          6           1