Question

通过总结国家/地区的性别统计数据，不确定如何在此处驯化ddply。我有这个数据框

df <- data.frame(country = c("Italy", "Germany", "Italy", "USA","Poland"),
                 gender = c("male", "female", "male", "female", "female"))

我想要一个数据框，每行详细说明每个国家/地区有多少男性和女性。然而

ddply(df,~country,table)

   country female male
1  Germany      1    0
2  Germany      0    0
3  Germany      0    0
4  Germany      0    0
5    Italy      0    0
6    Italy      0    2
7    Italy      0    0
8    Italy      0    0
9   Poland      0    0
10  Poland      0    0
11  Poland      1    0
12  Poland      0    0
13     USA      0    0
14     USA      0    0
15     USA      0    0
16     USA      1    0

虽然它产生了预期的结果，但它也为每组增加了三行。为什么呢？

Answer 1

我找到了这个解决方案。不确定是最优雅的。

df <- data.frame(country = c("Italy", "Germany", "Italy", "USA","Poland"),
                     gender = c("male", "female", "male", "female", NA))

ddply(df, .(country), summarise, 
      female=sum(gender=="female",na.rm = TRUE),
      male=sum(gender=="male", na.rm = TRUE),
      na=sum(is.na(gender)))

Answer 2

看起来你只想要

as.data.frame.matrix(table(df))

感谢：How to convert a table to a data frame

但要回答你关于你为什么得到输出的问题......

table基于因子水平，而不是基于矢量中的值。所以，如果你运行

df[df$country=="Germany",]$country

[1] Germany
Levels: Germany Italy Poland USA

您可以看到，在子集化后，国家/地区矢量仍然具有所有四个级别，但只有一个值。然后，当您运行table时，它会对每个级别进行汇总，即使它们不在向量中。

table(df[df$country=="Germany",])

         gender
country   female male
  Germany      1    0
  Italy        0    0
  Poland       0    0
  USA          0    0

调试ddply时，请务必在其根据数据创建的子集之一上试用您的函数。

Answer 3

由于您已经在plyr，为什么不使用count功能？

> library(plyr)
> count(df)
#   country gender freq
# 1 Germany female    1
# 2   Italy   male    2
# 3  Poland female    1
# 4     USA female    1

或者在基础R中，table

> ( tb <- table(df) )
#          gender
# country   female male
#   Germany      1    0
#   Italy        0    2
#   Poland       1    0
#   USA          1    0

ADDED ：根据下面的OP评论，要将上表转换为数据框，您可以操作，使用和更改其属性。

> as.data.frame(cbind(country = rownames(tb), unclass(tb)),
                row.names = "NULL")
#   country female male
# 1 Germany      1    0
# 2   Italy      0    2
# 3  Poland      1    0
# 4     USA      1    0

使用base :: table作为plyr :: ddply的参数

3 个答案: