我有一个面板数据集,包含60个国家的10个变量,18年(2000-2017),我有很多缺失的数据。
Country Year Broadband
Albania 2000 NA
Albania 2001 NA
Albania 2002 NA
Albania 2003 NA
Albania 2004 NA
Albania 2005 272
Albania 2006 NA
Albania 2007 10000
Albania 2008 64000
Albania 2009 92000
Albania 2010 105539
Albania 2011 128210
Albania 2012 160088
Albania 2013 182556
Albania 2014 207931
Albania 2015 242870
Albania 2016 263874
Albania 2017 NA
Algeria 2000 NA
Algeria 2001 NA
Algeria 2002 NA
Algeria 2003 18000
Algeria 2004 36000
我想使用R中的na.approx函数进行插值(并使用rule = 2进行外推),但仅限于每个国家/地区。例如,在这个样本数据集中,我想插入阿尔巴尼亚2006的值,并推断阿尔巴尼亚2000-2004和2017年。但我想确保2017年阿尔巴尼亚的价值不使用阿尔巴尼亚2016和阿尔及利亚2003进行插值。对于阿尔及利亚2000-2002,我希望使用阿尔及利亚2003年和2004年的数据来推断这些数值。我尝试了以下代码:
data <- group_by(data, country)
data$broadband <- na.approx(data$broadband, maxgap = Inf, rule = 2)
data <- as.data.frame(data)
并尝试过maxgap的不同值,但似乎没有解决我的问题。我假设使用group_by函数它可以正常工作,但事实并非如此。有谁知道任何解决方案?
编辑:我想要做的事情的唯一方法是使用以下代码将数据集拆分为每个唯一国家/地区的单独数据集:
mylist <- split(data, data$country)
alb <- mylist[1]
alb <- as_data_frame(alb)
alg <- mylist[2]
alg <- as_data_frame(alg)
ang <- mylist[3]
ang <- as_data_frame(ang)
然后在单独的数据集上一次使用na.approx函数。
编辑2:
我已经尝试过下面Markus建议的解决方案,但它似乎不起作用。这是使用您建议的安哥拉值编码的结果:
Country Year Broadband Broadband_imp
Algeria 2014 1599692 1599692
Algeria 2015 2269348 2269348
Algeria 2016 2858906 2858906
Angola 2000 NA 2451556.286
Angola 2001 NA 2044206.571
Angola 2002 NA 1636856.857
Angola 2003 NA 1229507.143
Angola 2004 NA 822157.429
Angola 2005 NA 414807.714
Angola 2006 7458 7458
Angola 2007 11700 11700
正如您所看到的,安哥拉2000-2005的估算值似乎是根据阿尔及利亚的数值计算的,因为估算的值远远高于安哥拉2006年7458的值。
编辑3:这是我使用的完整代码 -
data <- read_excel("~/Documents/data.xlsx")
> dput(head(data))
structure(list(continent = c("Europe", "Europe", "Europe", "Europe",
"Europe", "Europe"), country = c("Albania", "Albania", "Albania",
"Albania", "Albania", "Albania"), Year = c(2000, 2001, 2002,
2003, 2004, 2005), `Individuals Using Internet, %, WB` = c(0.114097347,
0.325798377, 0.390081273, 0.971900415, 2.420387798, 6.043890864
), `Secure Internet Servers, WB` = c(NA, 1, NA, 1, 2, 1), `Mobile Cellular
Subscriptions, WB` = c(29791,
392650, 851000, 1100000, 1259590, 1530244), `Fixed Broadband Subscriptions,
WB` = c(NA,
NA, NA, NA, NA, 272), `Trade, % GDP, WB` = c(55.9204287230026,
57.4303612453301, 63.9342407411882, 65.4406219482911, 66.3578254370479,
70.2953012017195), `Air transport, freight (million ton-km)` = c(0.003,
0.003, 0.144, 0.088, 0.099, 0.1), `Air Transport, registered carrier
departures worldwide, WB` = c(3885,
3974, 3762, 3800, 4104, 4309), `FDI, net, inflows, % GDP, WB` =
c(3.93717707227928,
5.10495722596557, 3.04391445388559, 3.09793068135411, 4.66563777108359,
3.21722676118428), `Number of Airports, WFB` = c(10, 11, 11,
11, 11, 11), `Currently under EU Arms Sanctions` = c(0, 0, 0,
0, 0, 0), `Currently under EU Economic Sanctions` = c(0, 0, 0,
0, 0, 0), `Currently under UN Arms Sanctions` = c(0, 0, 0, 0,
0, 0), `Currently under UN Economic Sanctions` = c(0, 0, 0, 0,
0, 0), `Currently under US Arms Embargo` = c(0, 0, 0, 0, 0, 0
), `Currently under US Economic Sanctions` = c(0, 0, 0, 0, 0,
0)), .Names = c("continent", "country", "Year", "Individuals Using Internet,
%, WB",
"Secure Internet Servers, WB", "Mobile Cellular Subscriptions, WB",
"Fixed Broadband Subscriptions, WB", "Trade, % GDP, WB", "Air transport,
freight (million ton-km)",
"Air Transport, registered carrier departures worldwide, WB",
"FDI, net, inflows, % GDP, WB", "Number of Airports, WFB", "Currently under EU
Arms Sanctions",
"Currently under EU Economic Sanctions", "Currently under UN Arms Sanctions",
"Currently under UN Economic Sanctions", "Currently under US Arms Embargo",
"Currently under US Economic Sanctions"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
data_imputed <- data %>%
group_by(country) %>%
mutate(broadband_imp = na.approx(broadband, maxgap=Inf, rule = 2))
答案 0 :(得分:1)
您可以使用group_by
和mutate
:
library(tidyverse)
library(zoo)
df_imputed <- df %>%
group_by(Country) %>%
mutate(Broadband_imputed = na.approx(Broadband, maxgap = Inf, rule = 2))
哪个给出了
> head(df_imputed)
# A tibble: 6 x 4
# Groups: Country [1]
Country Year Broadband Broadband_imputed
<fctr> <int> <int> <dbl>
1 Albania 2000 NA 272
2 Albania 2001 NA 272
3 Albania 2002 NA 272
4 Albania 2003 NA 272
5 Albania 2004 NA 272
6 Albania 2005 272 272
和
> df_imputed %>% filter(Country == 'Algeria')
# A tibble: 5 x 4
# Groups: Country [1]
Country Year Broadband Broadband_imputed
<fctr> <int> <int> <dbl>
1 Algeria 2000 NA 18000
2 Algeria 2001 NA 18000
3 Algeria 2002 NA 18000
4 Algeria 2003 18000 18000
5 Algeria 2004 36000 36000
数据强>
df <- read.table(text = "Country Year Broadband
Albania 2000 NA
Albania 2001 NA
Albania 2002 NA
Albania 2003 NA
Albania 2004 NA
Albania 2005 272
Albania 2006 NA
Albania 2007 10000
Albania 2008 64000
Albania 2009 92000
Albania 2010 105539
Albania 2011 128210
Albania 2012 160088
Albania 2013 182556
Albania 2014 207931
Albania 2015 242870
Albania 2016 263874
Albania 2017 NA
Algeria 2000 NA
Algeria 2001 NA
Algeria 2002 NA
Algeria 2003 18000
Algeria 2004 36000", header = TRUE)