我有一个非常大的混合数据集(字符变量,数值变量,因子),其中负值通常表示缺失值,请参见<script src="https://code.jquery.com/jquery-3.1.1.min.js"></script>
<script src="https://code.highcharts.com/maps/highmaps.js"></script>
<script src="https://code.highcharts.com/maps/modules/data.js"></script>
<script src="https://code.highcharts.com/maps/modules/exporting.js"></script>
<script src="https://code.highcharts.com/maps/modules/offline-exporting.js"></script>
<script src="https://code.highcharts.com/mapdata/custom/world.js"></script>
<div id="container"></div>
,但并非总是如此,请参见Scales
:
Profit
我想用NA替换所有负值:
Country Ccode Year Profit Scale ID Happiness_d Power_d ID_d
<chr> <fcr> <dbl> <dbl> <labelled> <dbl> <dbl> <dbl> <dbl>
1 France FR 2000 1000 NA 1 40000. 160000. 1.67
2 France FR 2001 -1200 1 1 80000. 320000. 1.67
3 France FR 2000 1400 0 2 40000. 160000. 1.67
4 France FR 2001 1600 3 2 80000. 320000. 1.67
5 UK UK 2000 -1000 -9 3 40000. 160000. 1.67
6 UK UK 2001 1000 2 3 80000. 320000. 1.67
7 UK UK 2000 1000 4 4 40000. 160000. 1.67
8 UK UK 2001 1000 0 4 80000. 320000. 1.67
问题是,尽管打算删除表示df[df< 0] <- NA
中代表NA的负值,但在示例数据集中它将删除Scale
中显然不是NA的负数。 / p>
因此,我想使结果取决于变量的范围。 Profit
变量的结构如下:
Scale
我已经发现,使用Class 'labelled' atomic [1:135894] NA NA 2 NA NA NA NA NA NA NA ...
..- attr(*, "label")= chr "Do You Use Technology Licensed From A Foreign-Owned Company?"
..- attr(*, "format.stata")= chr "%24.0g"
..- attr(*, "labels")= Named num [1:3] -9 1 2
.. ..- attr(*, "names")= chr [1:3] "Don't Know (Spontaneous)" "Yes" "No"
> names(New_Comprehensive_June_25_2018$e6)
库link可以从中得出因子水平;
haven
使用get_values()。
..- attr(*, "labels")= Named num [1:3] -9 1 2
是否有可能使解决方案仅消除这些负面因素而不是其他负面因素?
get_values(df$Scale)
[1] -9 1 2
要清楚,所需的输出将是:
..- attr(*, "labels")= Named num [1:3] -9 1 2
dput示例(请注意,变量 Country Ccode Year Profit Scale ID Happiness_d Power_d ID_d
<chr> <fcr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France FR 2000 1000 NA 1 40000. 160000. 1.67
2 France FR 2001 -1200 1 1 80000. 320000. 1.67
3 France FR 2000 1400 0 2 40000. 160000. 1.67
4 France FR 2001 1600 3 2 80000. 320000. 1.67
5 UK UK 2000 -1000 **NA** 3 40000. 160000. 1.67
6 UK UK 2001 1000 2 3 80000. 320000. 1.67
7 UK UK 2000 1000 4 4 40000. 160000. 1.67
8 UK UK 2001 1000 0 4 80000. 320000. 1.67
实际上不存在:
Scale
答案 0 :(得分:1)
这是一个简单的示例,您可以将其应用于数据集。
# example data
df = data.frame(a = c("A","A","B"),
x = c(1,2,3),
y = c(NA,3,-7),
z = c(200,300,-400))
library(dplyr)
df %>% mutate_if(is.numeric, ~ifelse(between(min(., na.rm = T), -9, -1) & .<0, NA, .))
# a x y z
# 1 A 1 NA 200
# 2 A 2 3 300
# 3 B 3 NA -400
仅当该列为数字并且该列的最小值在-9和-1之间时,您才能更新(mutate
)列。并且更新是将负值替换为NA
。
这假设您只有整数值。如果没有,您可以使用between(..., -9, 0)
。
答案 1 :(得分:0)
Base-R解决方案:
# Find negative value from 3rd column onwards, replace it with NA
# and bind with Country,Ccode and Profit columns.
cbind(df[,c(1,2,4)],do.call(cbind, lapply(df[,-c(1,2,4)], function(x) ifelse(x<0,NA,x))))
输出:
Country Ccode Profit Year Scale ID Happiness_d Power_d ID_d
1 France FR 1000 2000 NA 1 40000 160000 1.67
2 France FR -1200 2001 1 1 80000 320000 1.67
3 France FR 1400 2000 0 2 40000 160000 1.67
4 France FR 1600 2001 3 2 80000 320000 1.67
5 UK UK -1000 2000 NA 3 40000 160000 1.67
6 UK UK 1000 2001 2 3 80000 320000 1.67
7 UK UK 1000 2000 4 4 40000 160000 1.67
8 UK UK 1000 2001 0 4 80000 320000 1.67