如何在R

时间:2015-10-14 03:35:13

标签: r aggregate text-mining corpus

这是一个汽车评论数据,有超过40,000行,每个评论有超过500个字符。这是示例数据:https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E

| brand  | review          | favorite        | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 |    |    |    |    |    |  
| brand2 | 500 characters2 | 100 Characters2 |    |    |    |    |    | 
| brand2 | 500 characters3 | 100 Characters3 |    |    |    |    |    |
| brand2 | 500 characters4 | 100 Characters4 |    |    |    |    |    | 
| brand3 | 500 characters5 | 100 Characters5 |    |    |    |    |    | 
| brand3 | 500 characters6 | 100 characters6 |    |    |    |    |    |

我想按照以下品牌合并评论专栏:

| Brand  | review          | favorite        | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 |    |    |    |    |    |  
| brand2 | 500 characters2 | 100 Characters2 |    |    |    |    |    | 
|        | 500 characters3 | 100 Characters3 |    |    |    |    |    |
|        | 500 characters4 | 100 Characters4 |    |    |    |    |    | 
| brand3 | 500 characters5 | 100 Characters5 |    |    |    |    |    | 
|        | 500 characters6 | 100 characters6 |    |    |    |    |    |

所以,我厌倦了使用aggregate()。

temp <- aggregate(data$review ~ data$brand , data, as.list )

但是,这需要很长时间。

有没有简单的方法来合并? 提前谢谢!

1 个答案:

答案 0 :(得分:0)

尝试在每个因素上拆分它们然后将它们粘贴在一起。 aggregate()是一个非常慢的函数,应该避免除了最小的数据集以外的所有数据集。

这应该可以解决问题:(请注意我在此处将您的Google文件下载为sampleDF.csv

sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)

# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")

# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")

newDf <- data.frame(brand = names(brand.split),
                    text <- favorite.grouped,
                    favorite <- favorite.grouped,
                    stringsAsFactors = FALSE)

如果您想引入其他变量,他们只需要在品牌级别上有所不同。