我有以下数据框:
File, Paragraph, Sentence, Entity
article1.txt, 1, 1, USA
article1.txt, 1, 1, Canada
article1.txt, 1, 2, Toronto
article1.txt, 1, 2, New York
article2.txt, 1, 1, China
article2.txt, 1, 1, Japan
我可以按文件汇总:
occurrences<-rep.int(1,nrow(entity.locations))
entity.locations<-cbind(entity.locations, occurrences)
aggregate(occurrences ~ File + Paragraph + Sentence,
data = entity.locations[, c(1, 2, 3)], FUN = sum)
所以我有结果:
File, Paragraph, Sentence, occurrences
article1.txt, 1, 1, 2
article1.txt, 1, 2, 2
article2.txt, 1, 1, 2
现在我想做同样的事情,NA
值:
File, Paragraph, Sentence, Entity
article1.txt, 1, 1, USA
article1.txt, 1, 1, Canada
article1.txt, 1, 2, Toronto
article1.txt, 1, 2, New York
article2.txt, 1, 1, China
article2.txt, 1, 1, Japan
NA, 1, 1, Ted Cruz
NA, 1, 1, Trump
NA, 1, 1, Hillary
NA, 2, 1, Putin
预期结果为:
File, Paragraph, Sentence, occurrences
article1.txt, 1, 1, 2
article1.txt, 1, 2, 2
article2.txt, 1, 1, 2
NA, 1, 1, 3
NA, 2, 1, 1
如何聚合,有或不超过NA
值,没有问题?
它是na.action
参数中的解决方案?
答案 0 :(得分:2)
让aggregate
正常工作可能有点令人沮丧。在这种情况下,您需要将File
设置为其级别中包含NA
的因子,并将其出现在计数中:
df$File <- factor(df$File, exclude = NULL)
df$occurrences <- 1
aggregate(occurrences ~ File + Paragraph + Sentence, data = df, FUN = sum)
# File Paragraph Sentence occurrences
# 1 article1.txt 1 1 2
# 2 article2.txt 1 1 2
# 3 <NA> 1 1 3
# 4 <NA> 2 1 1
# 5 article1.txt 1 2 2
此类任务的常用替代方法是dplyr
:
library(dplyr)
df %>% group_by(File, Paragraph, Sentence) %>% summarise(occurrences = n())
# Source: local data frame [5 x 4]
# Groups: File, Paragraph [?]
#
# File Paragraph Sentence occurrences
# (fctr) (int) (int) (int)
# 1 article1.txt 1 1 2
# 2 article1.txt 1 2 2
# 3 article2.txt 1 1 2
# 4 NA 1 1 3
# 5 NA 2 1 1
和data.table
:
library(data.table)
setDT(df)[, .(occurrences = .N), .(File, Paragraph, Sentence)]
# File Paragraph Sentence occurrences
# 1: article1.txt 1 1 2
# 2: article1.txt 1 2 2
# 3: article2.txt 1 1 2
# 4: NA 1 1 3
# 5: NA 2 1 1
选择你最喜欢的。