我遇到了总和问题
spread(FUND, PTD_BALANCE, fill = 0) %>%
summarise_all(funs(sum))
错误地为某些列中的所有值返回0。即使我在传播中允许NAs并在汇总中删除它们,也会发生这种情况。点差从原始的4列中提取25个变量。以下是我试图尝试无效的一些方法:
Budget_FY11_FY18 <- read.csv("FY_8yr_Adopted_Fund_Clean.csv",
colClasses = c(rep("factor",6), "double"))
MBudget_Mvar <- Budget_FY11_FY18 %>%
select(BUDGET_NAME, PERIOD_NAME, FUND, PTD_BALANCE) %>%
unite("FY_Month", BUDGET_NAME, PERIOD_NAME, remove = TRUE) %>%
group_by(FY_Month) %>%
mutate(i = row_number()) %>%
spread(FUND, PTD_BALANCE, fill = 0) %>%
summarise_all(funs(sum))
dput
的{{1}} head
是(删除了某些标签):
Budget_FY11_FY18
虽然我也尝试以dput(head(Budget_FY11_FY18))
structure(list(BUDGET_NAME = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("FY11 ADOPTED", "FY12 ADOPTED", "FY13 ADOPTED",
"FY14 ADOPTED", "FY15 ADOPTED", "FY16 ADOPTED", "FY17 ADOPTED",
"FY18 ADOPTED"), class = "factor"), PERIOD_NUM = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("1", "10", "11", "12", "2", "3",
"4", "5", "6", "7", "8", "9"), class = "factor"), FUND = structure(c(6L,
6L, 6L, 6L, 6L, 6L), .Label = c(), class = "factor"),
SERVICE_CENTER = structure(c(223L, 223L, 223L, 223L, 223L,
223L), .Label = c(), class = "factor"), ACCOUNT = structure(c(3L,
5L, 359L, 202L, 203L, 371L), .Label = c(), class = "factor"),
PERIOD_NAME = structure(c(6L, 6L, 6L, 6L, 6L, 6L), .Label = c("April",
"August", "December", "February", "January", "July", "June",
"March", "May", "November", "October", "September"), class = "factor"),
PTD_BALANCE = c(-21895250, -650000, -435042, -4300000, -322908,
-513417)), .Names = c("BUDGET_NAME", "PERIOD_NUM", "FUND",
"SERVICE_CENTER", "ACCOUNT", "PERIOD_NAME", "PTD_BALANCE"), row.names = c(NA,
6L), class = "data.frame")
的形式阅读非数字列,但会产生以下character
:
dput
目前我已加载以下软件包:
> dput(head(Budget_FY11_FY18))
structure(list(BUDGET_NAME = c("FY11 ADOPTED", "FY11 ADOPTED",
"FY11 ADOPTED", "FY11 ADOPTED", "FY11 ADOPTED", "FY11 ADOPTED"
), PERIOD_NUM = c("1", "1", "1", "1", "1", "1"), FUND = c("General Fund",
"General Fund", "General Fund", "General Fund", "General Fund",
"General Fund"), SERVICE_CENTER = c("Unallocated", "Unallocated",
"Unallocated", "Unallocated", "Unallocated", "Unallocated"),
ACCOUNT = c("Ad Valorem Tax - Current", "Ad Valorem Tax Prior",
"PILOT's", "In Lieu Of Taxes-Utils", "In Lieu Of Taxes-Sewer",
"Property Taxes Interest & Penalty"), PERIOD_NAME = c("July",
"July", "July", "July", "July", "July"), PTD_BALANCE = c(-21895250,
-650000, -435042, -4300000, -322908, -513417)), .Names = c("BUDGET_NAME",
"PERIOD_NUM", "FUND", "SERVICE_CENTER", "ACCOUNT", "PERIOD_NAME",
"PTD_BALANCE"), row.names = c(NA, 6L), class = "data.frame")
我尝试了各种隔离方法。
其他背景:我试图通过〜420k观测值对数据集进行扩散和求和,以准备分析为多变量时间序列。数据属于数字级,范围从5400万到-200万。符号更改的原因是数据集代表预算。
非常感谢任何帮助!
答案 0 :(得分:0)
我最初认为该问题与之前回答的问题here和here中描述的问题类似。
尽管akrun和Tung正确地指出了Error in as.character.factor(x) : malformed factor
个错误,但事实证明,在进一步审核我的原始数据后,原始代码实际上返回了正确的值,但显然使用代码可能会在其他地方产生问题。
就我的目的而言,方法中的缺陷是线性代数和模型选择之一,发生在下游。问题中描述的操作产生的矩阵完全是单数。
我认为问题源于重塑和总结是不正确的。
任何随后的讨论都可能最好放在Cross Validated上,或者重新定义为关于畸形因素发生的问题。
如果确定这个问题及其回答/“答案”对社区没有任何价值,则应将其删除。