行方式计算R数据帧中的评论文本中的单词数

时间:2016-06-26 12:48:45

标签: r dplyr text-mining

我想计算每行中的字数:

Review_ID   Review_Date   Review_Content   Listing_Title   Star   Hotel_Name
 1          1/25/2016     I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back!   Outstanding  5  Crosby Street Hotel
 2          1/18/2016     We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return!   Always perfect!   5   Crosby Street Hotel

我在想:

WordFreqRowWise %>% 
rowwise() %>%
summarise(n = n())

获得类似的结果

Review_ID   Review_Content   total_Words   Min_occrd_word   Max      Average
   1            ....            230           great: 1      the: 25  total_unique/total_words in the row

但是没有想法,我该怎么做......

2 个答案:

答案 0 :(得分:2)

以下是使用strsplitsapply的基础R中的方法。假设数据存储在data.frame df中,并且评论存储在变量Review_Content中

# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")

# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)

在这种情况下,sapply将返回每行计数的向量。

由于单词计数现在是一个对象,您可以对其执行所需的分析。以下是一些例子:

  • 总结了字数的分布:summary(df$wordCount)
  • 最大字数:max(df$wordCount)
  • 平均字数:mean(df$wordCount)
  • 字数范围:range(df$wordCount)
  • 四分位数字数:IQR(df$wordCount)

答案 1 :(得分:1)

添加 @lmo 以上的答案..

下面的代码会生成一个数据框,其中包含所有单词,行方式及其频率:

 temp2 <- data.frame()
 for (i in 1:length(temp)){
    temp1 <- as.data.frame(table(temp[[i]]))
    temp1$ID <- paste0("Row_", i)
    temp2 <- rbind(temp2, temp1)
    temp1 <- NULL
  }