使用整体数据中的比率进行插值

时间:2018-05-28 04:59:59

标签: r data.table

以下是data.table z的外观。 (dput输出在问题的底部提供) -

> require(data.table); z
     SurveyResponseID WhereStayed        LoS   Nights
  1:          3274455  Wellington 42741.9436   0.0000
  2:          3274476      Raglan 39591.9555   0.0000
  3:          3274493    Auckland   877.0862 877.0862
  4:          3274503    Matakohe  6865.8103       NA
  5:          3274506    Auckland 81982.5017   0.0000
 ---                                                 
146:          3275696    Clevedon  2871.3504       NA
147:          3275707    Hastings   748.8108 561.6081
148:          3275708   Stratford 23785.4769   0.0000
149:          3275715     Waitomo  1600.3829   0.0000
150:          3275728 Cape Reinga 11787.2847   0.0000

Nights有几个NA个。我希望将LoS的值按照WhereStayed的位置与非NA数据的其余部分相同的比例进行分摊。

例如,考虑SurveyResponseID == 3274528

的情况
> z[SurveyResponseID == 3274528]
   SurveyResponseID WhereStayed      LoS Nights
1:          3274528    Auckland 20113.82     NA
2:          3274528    Hamilton 20113.82     NA
3:          3274528     Rotorua 20113.82     NA

现在,在完整的数据中,这是奥克兰,罗托鲁瓦和汉密尔顿的分布 -

> z[WhereStayed %in% c('Rotorua', 'Hamilton', 'Auckland') & !is.na(Nights), .(Nights = sum(Nights)), by = WhereStayed]
   WhereStayed   Nights
1:    Auckland 5019.240
2:    Hamilton 1502.824
3:     Rotorua 3271.130

大约51.25%奥克兰,15.35%汉密尔顿和33.4%罗托鲁瓦。使用这些份额,我希望以该比率分发20113.82,并将其分配给受访者NA的三个3274528

因此,NA插补后的数据看起来像是5397.309 = 26.8% * 20113.82 -

> z[SurveyResponseID == 3274528]
   SurveyResponseID WhereStayed      LoS    Nights
1:          3274528    Auckland 20113.82 10308.802
2:          3274528    Hamilton 20113.82  3086.585
3:          3274528     Rotorua 20113.82  6718.435

我确实有一个涉及中间数据表的解决方案,然后加入回z data.table,但我不确定它是data.table {{} 1}}做事的方式。

以下是我的长期方法,但很笨重。

ratios <- z[!is.na(Nights), .(Ratio = sum(Nights)), by = .(WhereStayed)]
ratios[, Ratio:=Ratio/sum(Ratio)]
z <- ratios[z, on = 'WhereStayed']
z[, Ratio:=Ratio/sum(Ratio), by = .(SurveyResponseID)]
z[is.na(Nights), Nights:=LoS*Ratio]

这具有以下预期输出(仅显示is.na(Nights)) -

    SurveyResponseID    WhereStayed        LoS    Nights
 1:          3274503       Matakohe  6865.8103        NA
 2:          3274528       Auckland 20113.8224 10308.802
 3:          3274528       Hamilton 20113.8224  3086.585
 4:          3274528        Rotorua 20113.8224  6718.435
 5:          3274583       Auckland 11712.8500 11712.850
 6:          3274607  Rakino Island  1161.6147        NA
 7:          3274715      Port Levy  2312.9432        NA
 8:          3274738 Waiheke Island  3036.9614        NA
 9:          3274752       Auckland   718.4200   718.420
10:          3274752          Kumeu   718.4200     0.000
11:          3274899       Auckland 96724.3395 96724.339
12:          3275082          Orewa  2125.8577        NA
13:          3275238       Auckland  4904.1634  4904.163
14:          3275256          Kumeu  5607.1564       NaN
15:          3275309       Auckland  4319.0176  4319.018
16:          3275319       Auckland  8634.8011  8634.801
17:          3275525       Auckland 25661.6887 25661.689
18:          3275560 Waiheke Island   915.7693        NA
19:          3275560       Auckland   915.7693        NA
20:          3275696       Clevedon  2871.3504  2871.350

Nights中仍然存在的缺失是可以的,因为在这些情况下,z中没有可以提取的数据。

此问题的数据-------------------------

z <- structure(list(SurveyResponseID = c(3274455L, 3274476L, 3274493L, 
3274503L, 3274506L, 3274510L, 3274517L, 3274518L, 3274523L, 3274526L, 
3274528L, 3274528L, 3274528L, 3274532L, 3274583L, 3274594L, 3274605L, 
3274607L, 3274629L, 3274645L, 3274655L, 3274659L, 3274679L, 3274679L, 
3274692L, 3274694L, 3274700L, 3274709L, 3274715L, 3274719L, 3274726L, 
3274738L, 3274750L, 3274752L, 3274752L, 3274764L, 3274771L, 3274771L, 
3274789L, 3274800L, 3274838L, 3274839L, 3274843L, 3274866L, 3274866L, 
3274874L, 3274880L, 3274880L, 3274894L, 3274899L, 3274912L, 3274918L, 
3274923L, 3274947L, 3274966L, 3274971L, 3274979L, 3274980L, 3275003L, 
3275019L, 3275046L, 3275050L, 3275052L, 3275057L, 3275064L, 3275072L, 
3275075L, 3275079L, 3275082L, 3275085L, 3275101L, 3275102L, 3275103L, 
3275108L, 3275128L, 3275129L, 3275150L, 3275152L, 3275160L, 3275166L, 
3275170L, 3275170L, 3275174L, 3275174L, 3275210L, 3275230L, 3275238L, 
3275240L, 3275246L, 3275256L, 3275280L, 3275288L, 3275292L, 3275294L, 
3275295L, 3275304L, 3275309L, 3275319L, 3275330L, 3275344L, 3275362L, 
3275378L, 3275379L, 3275394L, 3275399L, 3275406L, 3275409L, 3275411L, 
3275411L, 3275418L, 3275436L, 3275443L, 3275454L, 3275463L, 3275465L, 
3275470L, 3275496L, 3275498L, 3275504L, 3275510L, 3275521L, 3275525L, 
3275538L, 3275544L, 3275545L, 3275546L, 3275554L, 3275555L, 3275555L, 
3275556L, 3275556L, 3275556L, 3275560L, 3275560L, 3275563L, 3275566L, 
3275569L, 3275581L, 3275604L, 3275606L, 3275626L, 3275638L, 3275683L, 
3275691L, 3275692L, 3275696L, 3275707L, 3275708L, 3275715L, 3275728L
), WhereStayed = c("Wellington", "Raglan", "Auckland", "Matakohe", 
"Auckland", "Christchurch", "Auckland", "Milton", "Dannevirke", 
"Auckland", "Auckland", "Hamilton", "Rotorua", "Twizel", "Auckland", 
"Otaki", "Greymouth", "Rakino Island", "Houhora", "Napier", "Christchurch", 
"Waipoua Forest", "Oamaru", "Dunedin", "Wellington", "Hamilton", 
"Westport", "Wellington", "Port Levy", "Lake Tekapo", "Milton", 
"Waiheke Island", "Paihia", "Auckland", "Kumeu", "Omarama", "Rotorua", 
"Tauranga", "Timaru", "Abel Tasman National Park", "Auckland", 
"Queenstown", "Warkworth", "Te Anau", "Craigieburn", "Milford Sound", 
"Nelson", "Christchurch", "Rotorua", "Auckland", "New Plymouth", 
"Christchurch", "Queenstown", "Kumeu", "Auckland", "Paparoa National Park", 
"Waiotapu", "Whangarei", "Waitomo", "Queenstown", "Auckland", 
"Queenstown", "Christchurch", "Clevedon", "Waitomo", "Christchurch", 
"Taihape", "Christchurch", "Orewa", "Rotorua", "Franz Josef", 
"Pukekohe", "Kumeu", "Tairua", "Taupo", "Queenstown", "Omarama", 
"Auckland", "Hanmer Springs", "Rotorua", "Murchison", "Queenstown", 
"Queenstown", "Milford Sound", "Auckland", "Paparoa National Park", 
"Auckland", "Cromwell", "Queenstown", "Kumeu", "Clevedon", "Wellington", 
"Oamaru", "Queenstown", "Endeavour Inlet", "Blenheim", "Auckland", 
"Auckland", "Wellington", "Wanaka", "Masterton", "Whakapapa Village", 
"Tairua", "Rotorua", "Cape Kidnappers", "Waihua", "Arrowtown", 
"Cape Reinga", "Snells Beach", "Auckland", "Wellington", "Dunedin", 
"Auckland", "Taupo", "Abel Tasman National Park", "Dunedin", 
"Te Anau", "Christchurch", "Paihia", "Dunedin", "Hamilton", "Auckland", 
"Matamata", "Wanaka", "Catlins", "Paihia", "Franz Josef", "Taupo", 
"Kaikoura", "Westport", "Heaphy Track", "Piha", "Waiheke Island", 
"Auckland", "Wellington", "Whangamata", "Wanaka", "Westport", 
"Fiordland National Park", "Taupo", "Christchurch", "Te Anau", 
"Wellington", "Rotorua", "Marlborough", "Clevedon", "Hastings", 
"Stratford", "Waitomo", "Cape Reinga"), LoS = c(42741.9436047755, 
39591.9555163287, 877.08616280446, 6865.81028982635, 81982.5016525796, 
41375.3053535933, 4949.00343037598, 13643.8378966971, 1818.04165680688, 
7911.06178019024, 20113.8223823246, 20113.8223823246, 20113.8223823246, 
4297.21264743424, 11712.8500000521, 14342.9323259751, 1046.42962365774, 
1161.61465947518, 26684.8013647668, 2159.85594913809, 12382.5291370991, 
3572.88522911463, 3267.58643173956, 3267.58643173956, 9055.02741317069, 
42964.024708285, 62527.1602217821, 799.215837399333, 2312.9432017275, 
17807.880584828, 3684.55279910826, 3036.96143529467, 2095.19366998327, 
718.419976697589, 718.419976697589, 1299.69196347729, 56914.2840041613, 
56914.2840041613, 13328.4852202518, 5404.91247034716, 2522.48422126056, 
6165.64136973517, 9531.97012687062, 3894.39120716227, 3894.39120716227, 
2543.46846269262, 3414.14874750348, 3414.14874750348, 3771.30561388102, 
96724.3394654342, 3583.27705777555, 3041.13854297752, 3368.50460565427, 
3158.18811352136, 3904.66470252172, 5862.90633463616, 2882.83911001206, 
11805.2297665087, 6402.08709024943, 5186.94312706125, 870.69199642505, 
10091.1420543283, 8369.774757932, 7985.40888579288, 6926.3302645866, 
4420.06917925033, 1726.86768006798, 3974.48164722869, 2125.85771144444, 
4736.76735216895, 14504.7530311797, 62467.3075924298, 632.428436718402, 
6645.29389114695, 2241.80914051178, 1003.1560691685, 3134.88061131533, 
3604.1357395957, 48790.3266929933, 2098.82030322716, 3945.49519922237, 
3945.49519922237, 2136.34311305016, 2136.34311305016, 456.440663951212, 
10692.5752772267, 4904.16336515106, 10440.7991489425, 8828.17020986572, 
5607.15637428966, 4374.48421791468, 23277.4964101353, 3380.0999904256, 
1255.85228651154, 12561.9210632003, 7779.33569261148, 4319.01757077778, 
8634.80105492512, 12844.081196906, 3666.71285119098, 4176.94496342972, 
3288.20886332444, 2937.47178044397, 10205.4005090231, 19213.3721518298, 
8527.86375947078, 10195.2603554514, 3735.66582375512, 3735.66582375512, 
946.998025480878, 5279.64787567089, 10608.0756829274, 6242.27906140245, 
5455.41709954626, 1779.0727991838, 6029.46747996311, 4385.52398444791, 
14686.4890994835, 4171.39583798557, 2475.27432897754, 3005.64728199526, 
25661.6887253572, 11185.9596078473, 3539.88530105119, 13857.1961646826, 
3799.52953818341, 4053.93637885706, 3771.87058713216, 3771.87058713216, 
26410.8270985288, 26410.8270985288, 26410.8270985288, 915.769260388995, 
915.769260388995, 3294.46869510517, 4859.6269254318, 1968.91705023579, 
547.139652678248, 4224.21312757923, 11692.2356812747, 712.516366875341, 
9217.08214243521, 1265.12928478973, 5665.77537103692, 14824.4623882922, 
2871.35038838803, 748.810764275115, 23785.4768813912, 1600.38293737054, 
11787.2847424015), Nights = c(0, 0, 877.08616280446, NA, 0, 40170.1993724207, 
0, 2842.46622847856, 303.006942801147, 1483.32408378567, NA, 
NA, NA, 0, NA, 0, 74.7449731184097, NA, 0, 479.967988697354, 
0, 0, 136.149434655815, 136.149434655815, 0, 0, 0, 72.6559852181211, 
NA, 0, 0, NA, 0, NA, NA, 99.9763048828681, 503.666230125321, 
503.666230125321, 416.515163132868, 0, 360.354888751508, 362.68478645501, 
0, 0, 0, 0, 512.122312125522, 1877.78181112691, 179.585981613382, 
NA, 275.636696751966, 0, 748.556579034282, 0, 433.851633613524, 
279.186015935055, 0, 380.813863435765, 0, 357.720215659397, 870.69199642505, 
630.696378395519, 697.481229827667, 0, 0, 1262.87690835724, 90.8877726351567, 
722.633026768853, NA, 338.340525154925, 0, 54548.916489164, 0, 
4651.70572380286, 104.270192581943, 154.331702949, 174.160033961963, 
300.344644966308, 0, 0, 219.194177734576, 219.194177734576, 0, 
0, 0, 0, NA, 0, 8828.17020986572, NA, 4374.48421791468, 705.37867909501, 
160.957142401219, 179.407469501649, 0, 0, NA, NA, 856.272079793735, 
0, 219.839208601564, 102.756526978889, 267.04288913127, 833.093919103928, 
0, 275.092379337767, 0, 0, 81.2101266033722, 0, 310.567522098288, 
1811.13487269492, 693.58656237805, 363.694473303084, 0, 415.825343445732, 
230.817051813048, 0, 641.753205843934, 190.405717613657, 1502.82364099763, 
NA, 0, 307.816113134886, 6928.59808234131, 0, 0, 377.187058713216, 
377.187058713216, 614.205281361134, 1228.41056272227, 0, NA, 
NA, 3294.46869510517, 0, 0, 39.081403762732, 0, 6820.47081407691, 
712.516366875341, 801.485403690018, 0, 1416.44384275923, 658.864995035208, 
NA, 561.608073206336, 0, 0, 0)), row.names = c(NA, -150L), class = c("data.table", 
"data.frame"), index = structure(integer(0), "`__SurveyResponseID`" = integer(0)))

1 个答案:

答案 0 :(得分:3)

我会使用WhereStayed之和创建参考数据,然后在计算新值时运行连接,例如

## reference table with the sums
ref <- z[!is.na(Nights), .(Nights = sum(Nights)), by = WhereStayed]

## join z with ref
z[is.na(Nights), # join only where `Nights` are NAs
  Nights := ref[.SD, Nights / sum(Nights) * LoS, # Calculate the formula per join
                on = .(WhereStayed)], # join condition
  by = SurveyResponseID] # run this by `SurveyResponseID`

## Validation
z[SurveyResponseID == 3274528]
#    SurveyResponseID WhereStayed      LoS    Nights
# 1:          3274528    Auckland 20113.82 10308.802
# 2:          3274528    Hamilton 20113.82  3086.585
# 3:          3274528     Rotorua 20113.82  6718.435