我还在学习R中的数据管理。我知道我非常接近,但无法获得精确的语法。我看过了 count a variable by using a condition in R 和 Conditional count and group by in R 但不能完全转化为我的工作。我试图得到一个由ST等于0的dist.km计数。最终我想要添加具有不同距离范围的计数的列,但是应该能够在得到它之后得到它。最终表应该具有所有状态并且计数为0。这是一个20行样本。
structure(list(ST = structure(c(12L, 15L, 13L, 10L, 15L, 16L,
11L, 12L, 8L, 14L, 10L, 14L, 6L, 11L, 5L, 5L, 15L, 1L, 6L, 4L
), .Label = c("CT", "DE", "FL", "GA", "MA", "MD", "ME", "NC",
"NH", "NJ", "NY", "PA", "RI", "SC", "VA", "VT", "WV"), class = "factor"),
Rfips = c(42107L, 51760L, 44001L, 34001L, 51061L, 50023L,
36029L, 42101L, 37019L, 45079L, 34029L, 45055L, 24003L, 36027L,
25009L, 25009L, 51760L, 9003L, 24027L, 1111L), zip = c(17972L,
23226L, 2806L, 8330L, 20118L, 5681L, 14072L, 19115L, 28451L,
29206L, 8741L, 29020L, 20776L, 12545L, 1922L, 1938L, 23226L,
6089L, 21042L, 36278L), Year = c(2010L, 2005L, 2010L, 2008L,
2007L, 2006L, 2005L, 2008L, 2009L, 2008L, 2010L, 2006L, 2007L,
2008L, 2011L, 2011L, 2008L, 2005L, 2008L, 2009L), dist.km = c(0,
42.4689368078209, 28.1123394088972, 36.8547005648639, 0,
49.7276501081775, 0, 30.1937156926235, 0, 0, 31.5643658415831,
0, 0, 0, 0, 0, 138.854136893762, 0, 79.4320981205195, 47.1692144550079
)), .Names = c("ST", "Rfips", "zip", "Year", "dist.km"), row.names = c(132931L,
105670L, 123332L, 21361L, 51576L, 3520L, 47367L, 99962L, 18289L,
126153L, 19321L, 83224L, 6041L, 46117L, 49294L, 48951L, 109350L,
64465L, 80164L, 22687L), class = "data.frame")
以下是我尝试过的几段代码。
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
state= aggregate(dist.km ~ ST, function(x) sum(dist.km==0, data=DDcomplete))
state = (DDcomplete[DDcomplete$dist.km==0,], .(ST), function(x) nrow(x))
答案 0 :(得分:5)
如果您想将其添加为列,您可以执行以下操作:
DDcomplete %>% group_by(ST) %>% mutate(count = sum(dist.km == 0))
或者,如果您只想要每个州的计数:
DDcomplete %>% group_by(ST) %>% summarise(count = sum(dist.km == 0))
实际上,你非常接近解决方案。你的代码
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
几乎是正确的。您可以从DDcomplete$
的调用中删除sum
,因为在dplyr链中,您可以直接访问变量。
另请注意,通过使用summarise
,您将数据框压缩为每组1行,仅包含分组列以及summarise
内计算的任何内容。如果你只想添加一个带有计数的列,你可以像我在答案中那样使用mutate。
如果您只对肯定的计数感兴趣,您还可以使用dplyr的count
函数和filter
来首先对数据进行子集化:
filter(DDcomplete, dist.km == 0) %>% count(ST)
答案 1 :(得分:4)
我希望我没有错过任何内容,但听起来你只是想在做一些子集之后table
:
table(df[df$dist.km == 0, "ST"])
#
# CT DE FL GA MA MD ME NC NH NJ NY PA RI SC VA VT WV
# 1 0 0 0 2 1 0 1 0 0 2 1 0 2 1 0 0
其他方法可能是:
## dplyr, since you seem to be using it
library(dplyr)
df %>%
filter(dist.km == 0) %>%
group_by(ST) %>%
summarise(n())
## aggregate, since you tried that too
aggregate(dist.km ~ ST, df, function(x) sum(x == 0))
## data.table
library(data.table)
as.data.table(df)[dist.km == 0, .N, by = ST]