更新:发表评论(谢谢!)我现在通过将比率从'double'转换为'character'来成功合并这些数据集。我仍然有兴趣知道为什么R在合并舍入的'double'变量时遇到问题(使用merge
和sqldf
时也会出现问题,并且不限于dplyr)。
我在使用dplyr的left_join和'double'变量合并数据帧方面遇到了麻烦。我试图根据目标与预测变量之比的分组来连接小数据表(只有20行)确定这些比率在培训,验证和测试集结果的5%范围内的比例(来自h2o生成的模型)。
要生成比率组,我只需将它们四舍五入到最接近的一位小数(因此比率变量保持为'double'类型,就像原始计算值一样)。但是,当我尝试合并这些舍入值时,随机行的NA出现在合并的数据帧中。
下面是代码和生成的表的示例 - 最终表有一行舍入为0.7的比率的缺失值,但所有表都存在于原始表中。
这是一个可能的错误吗?
(我知道我可以在dplyr中更有效地进行计算,但由于目前没有合并,所以整理起来似乎没什么意义。)
trainR <- mutate(trainR, ratio_GBM = round((ZZ_TOP_PRICE_R/pred_GBM),digits=1))
> train_pcs_GBM <- count(trainR, ratio_GBM)
> train_pcs_GBM <- rename(train_pcs_GBM, train_count_GBM = n)
> train_pcs_GBM <- mutate(train_pcs_GBM, train_pc_GBM = round((train_count_GBM/length(trainR$ratio_GBM)*100),1))
> train_pcs_GBM
# A tibble: 32 x 3
ratio_GBM train_count_GBM train_pc_GBM
<dbl> <int> <dbl>
1 0.0 1 0.0
2 0.1 15 0.0
3 0.2 302 0.1
4 0.3 440 0.1
5 0.4 1575 0.3
6 0.5 23304 4.6
7 0.6 62381 12.4
8 0.7 71098 14.1
9 0.8 65792 13.0
10 0.9 66991 13.3
# ... with 22 more rows
> #View(train_pcs_GBM)
>
> validR <- mutate(validR, ratio_GBM = round((ZZ_TOP_PRICE_R/pred_GBM),digits=1))
> valid_pcs_GBM <- count(validR, ratio_GBM)
> valid_pcs_GBM <- rename(valid_pcs_GBM, valid_count_GBM = n)
> valid_pcs_GBM <- mutate(valid_pcs_GBM, valid_pc_GBM = round((valid_count_GBM/length(validR$ratio_GBM)*100),1))
> valid_pcs_GBM
# A tibble: 34 x 3
ratio_GBM valid_count_GBM valid_pc_GBM
<dbl> <int> <dbl>
1 0.1 9 0.0
2 0.2 148 0.1
3 0.3 168 0.1
4 0.4 688 0.4
5 0.5 8058 4.8
6 0.6 20980 12.5
7 0.7 23446 14.0
8 0.8 21932 13.1
9 0.9 22030 13.1
10 1.0 19342 11.5
# ... with 24 more rows
> View(valid_pcs_GBM)
>
> testR <- mutate(testR, ratio_GBM = round((ZZ_TOP_PRICE_R/pred_GBM),digits=1))
> test_pcs_GBM <- count(testR, ratio_GBM)
> test_pcs_GBM <- rename(test_pcs_GBM, test_count_GBM = n)
> test_pcs_GBM <- mutate(test_pcs_GBM, test_pc_GBM = round((test_count_GBM/length(testR$ratio_GBM)*100),1))
> test_pcs_GBM
# A tibble: 31 x 3
ratio_GBM test_count_GBM test_pc_GBM
<dbl> <int> <dbl>
1 0.1 10 0.0
2 0.2 148 0.1
3 0.3 154 0.1
4 0.4 600 0.4
5 0.5 8359 5.0
6 0.6 20739 12.4
7 0.7 23283 13.9
8 0.8 21899 13.1
9 0.9 22216 13.2
10 1.0 19202 11.4
# ... with 21 more rows
>
> # Merge table of percentages ******************************************************************************
> res_GBM <- data.frame(ratio = numeric(20))
> res_GBM$ratio <- seq(0.1,2,0.1)
> res_GBM <- left_join(res_GBM, train_pcs_GBM, by = c("ratio" = "ratio_GBM"))
> res_GBM <- left_join(res_GBM, valid_pcs_GBM, by = c("ratio" = "ratio_GBM"))
> res_GBM <- left_join(res_GBM, test_pcs_GBM, by = c("ratio" = "ratio_GBM"))
> res_GBM
ratio train_count_GBM train_pc_GBM valid_count_GBM valid_pc_GBM test_count_GBM test_pc_GBM
1 0.1 15 0.0 9 0.0 10 0.0
2 0.2 302 0.1 148 0.1 148 0.1
3 0.3 NA NA NA NA NA NA
4 0.4 1575 0.3 688 0.4 600 0.4
5 0.5 23304 4.6 8058 4.8 8359 5.0
6 0.6 62381 12.4 20980 12.5 20739 12.4
7 0.7 NA NA NA NA NA NA
8 0.8 65792 13.0 21932 13.1 21899 13.1
9 0.9 66991 13.3 22030 13.1 22216 13.2
10 1.0 58921 11.7 19342 11.5 19202 11.4
11 1.1 47096 9.3 15071 9.0 15086 9.0
12 1.2 NA NA NA NA NA NA
13 1.3 NA NA NA NA NA NA
14 1.4 NA NA NA NA NA NA
15 1.5 NA NA NA NA NA NA
16 1.6 6761 1.3 2372 1.4 2359 1.4
17 1.7 NA NA NA NA NA NA
18 1.8 NA NA NA NA NA NA
19 1.9 NA NA NA NA NA NA
20 2.0 1095 0.2 482 0.3 435 0.3
>
with(DL_FC_res, lineplot(ratio, test_pc, train_pc, valid_pc, FC_pc))