Question

我有以下数据框：

File, Paragraph, Sentence, Entity
article1.txt, 1, 1, USA
article1.txt, 1, 1, Canada
article1.txt, 1, 2, Toronto
article1.txt, 1, 2, New York
article2.txt, 1, 1, China
article2.txt, 1, 1, Japan

我可以按文件汇总：

occurrences<-rep.int(1,nrow(entity.locations)) entity.locations<-cbind(entity.locations, occurrences)

aggregate(occurrences ~ File + Paragraph + Sentence, data = entity.locations[, c(1, 2, 3)], FUN = sum)

所以我有结果：

File, Paragraph, Sentence, occurrences
article1.txt, 1, 1, 2
article1.txt, 1, 2, 2
article2.txt, 1, 1, 2

现在我想做同样的事情，NA值：

File, Paragraph, Sentence, Entity
article1.txt, 1, 1, USA
article1.txt, 1, 1, Canada
article1.txt, 1, 2, Toronto
article1.txt, 1, 2, New York
article2.txt, 1, 1, China
article2.txt, 1, 1, Japan
NA, 1, 1, Ted Cruz
NA, 1, 1, Trump
NA, 1, 1, Hillary
NA, 2, 1, Putin

预期结果为：

File, Paragraph, Sentence, occurrences
article1.txt, 1, 1, 2
article1.txt, 1, 2, 2
article2.txt, 1, 1, 2
NA, 1, 1, 3
NA, 2, 1, 1

如何聚合，有或不超过NA值，没有问题？它是na.action参数中的解决方案？

Answer 1

让aggregate正常工作可能有点令人沮丧。在这种情况下，您需要将File设置为其级别中包含NA的因子，并将其出现在计数中：

df$File <- factor(df$File, exclude = NULL)
df$occurrences <- 1
aggregate(occurrences ~ File + Paragraph + Sentence, data = df, FUN = sum)
#           File Paragraph Sentence occurrences
# 1 article1.txt         1        1           2
# 2 article2.txt         1        1           2
# 3         <NA>         1        1           3
# 4         <NA>         2        1           1
# 5 article1.txt         1        2           2

此类任务的常用替代方法是dplyr：

library(dplyr)
df %>% group_by(File, Paragraph, Sentence) %>% summarise(occurrences = n())
# Source: local data frame [5 x 4]
# Groups: File, Paragraph [?]
# 
#           File Paragraph Sentence occurrences
#         (fctr)     (int)    (int)       (int)
# 1 article1.txt         1        1           2
# 2 article1.txt         1        2           2
# 3 article2.txt         1        1           2
# 4           NA         1        1           3
# 5           NA         2        1           1

和data.table：

library(data.table)
setDT(df)[, .(occurrences = .N), .(File, Paragraph, Sentence)]
#            File Paragraph Sentence occurrences
# 1: article1.txt         1        1           2
# 2: article1.txt         1        2           2
# 3: article2.txt         1        1           2
# 4:           NA         1        1           3
# 5:           NA         2        1           1

选择你最喜欢的。

R - 考虑NA值的聚合

1 个答案: