如何分组,转动和统计R

时间:2018-01-28 04:07:02

标签: r

使用GermanCredit库中的caret数据集。

library("caret")
data(GermanCredit)

稍微过滤了一下

credit.all <- GermanCredit[,c(10, 1:9, 11:13, 16:19)]
attach(credit.all)
names(credit.all)

我们有这些名字

 [1] "Class"                          "Duration"                      
 [3] "Amount"                         "InstallmentRatePercentage"     
 [5] "ResidenceDuration"              "Age"                           
 [7] "NumberExistingCredits"          "NumberPeopleMaintenance"       
 [9] "Telephone"                      "ForeignWorker"                 
[11] "CheckingAccountStatus.lt.0"     "CheckingAccountStatus.0.to.200"
[13] "CheckingAccountStatus.gt.200"   "CreditHistory.ThisBank.AllPaid"
[15] "CreditHistory.PaidDuly"         "CreditHistory.Delay"           
[17] "CreditHistory.Critical"  

我需要做的是总结其中两列,我知道如何在SQL中做这样的事情。

SELECT
  Class
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) AS Critical
, SUM(CASE WHEN `CreditHistory.Critical` = 0 THEN 1 ELSE 0 END) AS NotCritical
, SUM(CASE WHEN `CreditHistory.Critical` = 1 THEN 1 ELSE 0 END) / COUNT(*) AS PctCritical
FROM `credit.all`
GROUP BY
  Class

哪会产生这样的东西 enter image description here

然而,我正努力在R中站稳脚跟,使用书籍和谷歌,似乎我应该使用reshape2 meltdcast来实现这样的目标。我试过的基本上是这个变种:

library(reshape2)
credit.melted <- melt(credit.all[,c(1,17)], ID=c("name", "Class"))
dcast(credit.melted, Class~CreditHistory.Critical, nrow, fill=0)

但是我对这些功能的所有尝试都产生了过于神秘和太常见的错误,无法理解我做错了什么。

Error in vapply(indices, fun, .default) : values must be length 1,
 but FUN(X[[1]]) result is length 0

有时我对函数调用的随机排列会产生稍微不同的错误输出,但没有任何东西可以指向正确的方向。

问题:如何使用R?

执行类似于SQL结果的轮转摘要

1 个答案:

答案 0 :(得分:2)

我不认为这是一个支点。您不是在SQL中尝试使用pivot命令。您可以使用library(dplyr) credit.all %>% group_by(Class) %>% summarize(Critical = sum(CreditHistory.Critical == 1), NotCritical = sum(CreditHistory.Critical == 0), PctCritical = mean(CreditHistory.Critical == 1)) # # A tibble: 2 x 4 # Class Critical NotCritical PctCritical # <fct> <int> <int> <dbl> # 1 Bad 50 250 0.167 # 2 Good 243 457 0.347 来执行与SQL完全相同的方法:

== 1

因为它是一个二进制列,所以credit.all %>% group_by(Class) %>% summarize(Critical = sum(CreditHistory.Critical), NotCritical = n() - Critical, PctCritical = Critical / n()) 并不是必需的,但是我把它留在了因为(a)它与你的SQL代码更相似,(b)如果有其他值,但你想要计数为1,这将是这样做的方式。但是,您可以更简单地得到相同的结果:

melt

如果你真的想要一个支点,我们可以走那条路,它看起来不那么简单。您的数据已经是长格式,因此我们不需要pivot = dcast(Class ~ CreditHistory.Critical, data = credit.all) pivot # Using CreditHistory.Critical as value column: use value.var to override. # Aggregation function missing: defaulting to length # Class 0 1 # 1 Bad 250 50 # 2 Good 457 243 ,我们可以直接投放:

names(pivot)[2:3] = c("NotCritical", "Critical")
pivot$PctCritical = with(pivot, Critical / (Critical + NotCritical)

然后,您可以重命名列并计算百分比:

request = requests.get("http://api.meetup.com/2/members?fields=birthday",params=params)