R:在扩展后使用summarise_all(funs(sum))返回0,即使删除了NAs

时间:2018-02-25 02:19:54

标签: r dplyr

我遇到了总和问题

    spread(FUND, PTD_BALANCE, fill = 0) %>%
    summarise_all(funs(sum))

错误地为某些列中的所有值返回0。即使我在传播中允许NAs并在汇总中删除它们,也会发生这种情况。点差从原始的4列中提取25个变量。以下是我试图尝试无效的一些方法:

Budget_FY11_FY18 <- read.csv("FY_8yr_Adopted_Fund_Clean.csv",
                             colClasses = c(rep("factor",6), "double"))

MBudget_Mvar <- Budget_FY11_FY18 %>%
        select(BUDGET_NAME, PERIOD_NAME, FUND, PTD_BALANCE) %>%
        unite("FY_Month", BUDGET_NAME, PERIOD_NAME, remove = TRUE) %>%
        group_by(FY_Month) %>%
        mutate(i = row_number()) %>%
        spread(FUND, PTD_BALANCE, fill = 0) %>%
        summarise_all(funs(sum))

dput的{​​{1}} head是(删除了某些标签):

Budget_FY11_FY18

虽然我也尝试以dput(head(Budget_FY11_FY18)) structure(list(BUDGET_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("FY11 ADOPTED", "FY12 ADOPTED", "FY13 ADOPTED", "FY14 ADOPTED", "FY15 ADOPTED", "FY16 ADOPTED", "FY17 ADOPTED", "FY18 ADOPTED"), class = "factor"), PERIOD_NUM = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "10", "11", "12", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), FUND = structure(c(6L, 6L, 6L, 6L, 6L, 6L), .Label = c(), class = "factor"), SERVICE_CENTER = structure(c(223L, 223L, 223L, 223L, 223L, 223L), .Label = c(), class = "factor"), ACCOUNT = structure(c(3L, 5L, 359L, 202L, 203L, 371L), .Label = c(), class = "factor"), PERIOD_NAME = structure(c(6L, 6L, 6L, 6L, 6L, 6L), .Label = c("April", "August", "December", "February", "January", "July", "June", "March", "May", "November", "October", "September"), class = "factor"), PTD_BALANCE = c(-21895250, -650000, -435042, -4300000, -322908, -513417)), .Names = c("BUDGET_NAME", "PERIOD_NUM", "FUND", "SERVICE_CENTER", "ACCOUNT", "PERIOD_NAME", "PTD_BALANCE"), row.names = c(NA, 6L), class = "data.frame") 的形式阅读非数字列,但会产生以下character

dput

目前我已加载以下软件包:

> dput(head(Budget_FY11_FY18))

structure(list(BUDGET_NAME = c("FY11 ADOPTED", "FY11 ADOPTED", 
"FY11 ADOPTED", "FY11 ADOPTED", "FY11 ADOPTED", "FY11 ADOPTED"
), PERIOD_NUM = c("1", "1", "1", "1", "1", "1"), FUND = c("General Fund", 
"General Fund", "General Fund", "General Fund", "General Fund", 
"General Fund"), SERVICE_CENTER = c("Unallocated", "Unallocated", 
"Unallocated", "Unallocated", "Unallocated", "Unallocated"), 
    ACCOUNT = c("Ad Valorem Tax - Current", "Ad Valorem Tax Prior", 
    "PILOT's", "In Lieu Of Taxes-Utils", "In Lieu Of Taxes-Sewer", 
    "Property Taxes Interest & Penalty"), PERIOD_NAME = c("July", 
    "July", "July", "July", "July", "July"), PTD_BALANCE = c(-21895250, 
    -650000, -435042, -4300000, -322908, -513417)), .Names = c("BUDGET_NAME", 
"PERIOD_NUM", "FUND", "SERVICE_CENTER", "ACCOUNT", "PERIOD_NAME", 
"PTD_BALANCE"), row.names = c(NA, 6L), class = "data.frame")

我尝试了各种隔离方法。

其他背景:我试图通过〜420k观测值对数据集进行扩散和求和,以准备分析为多变量时间序列。数据属于数字级,范围从5400万到-200万。符号更改的原因是数据集代表预算。

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

我最初认为该问题与之前回答的问题herehere中描述的问题类似。

尽管akrunTung正确地指出了Error in as.character.factor(x) : malformed factor个错误,但事实证明,在进一步审核我的原始数据后,原始代码实际上返回了正确的值,但显然使用代码可能会在其他地方产生问题。

就我的目的而言,方法中的缺陷是线性代数和模型选择之一,发生在下游。问题中描述的操作产生的矩阵完全是单数。

我认为问题源于重塑和总结是不正确的。

任何随后的讨论都可能最好放在Cross Validated上,或者重新定义为关于畸形因素发生的问题。

如果确定这个问题及其回答/“答案”对社区没有任何价值,则应将其删除。