这是一个汽车评论数据,有超过40,000行,每个评论有超过500个字符。这是示例数据:https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
我想按照以下品牌合并评论专栏:
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
所以,我厌倦了使用aggregate()。
temp <- aggregate(data$review ~ data$brand , data, as.list )
但是,这需要很长时间。
有没有简单的方法来合并? 提前谢谢!
答案 0 :(得分:0)
尝试在每个因素上拆分它们然后将它们粘贴在一起。 aggregate()
是一个非常慢的函数,应该避免除了最小的数据集以外的所有数据集。
这应该可以解决问题:(请注意我在此处将您的Google文件下载为sampleDF.csv
)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
如果您想引入其他变量,他们只需要在品牌级别上有所不同。