Question

我有一些研究地点，在那里我收集了数据，附近的气象站还提供了有关温度和降水的信息。我想将研究站点的每日数据与最近的气象站的天气信息配对。我认为，要做到这一点，我需要一个两步过程，首先选择离研究地点最近的气象站，然后使用天气数据创建一个新变量。

这是我的数据快照：

# study sites
site <- rep(LETTERS[1:3], 5)
siteLat <- rep(c(41, 42, 44), 5)
siteLon <- rep(c(68, 62, 63), 5)
siteDate <- rep(1:5, 3)
dfSites <- data.frame(cbind(site, siteLat, siteLon, siteDate))

# weather stations
station <- rep(letters[1:3], 5)
stationLat <- rep(c(40, 43, 45), 5)
stationLon <- rep(c(67, 61, 64), 5)
stationDate <- rep(1:5, 3)
temp <- sample(10:20, 15, replace=TRUE)
dfStation <- data.frame(cbind(station, stationLat, stationLon, stationDate, temp))

我试图用这条线来确定哪个车站最近，但是我只能得到一行距离。

distVincentyEllipsoid(df2[c("recvLon", "recvLat")], weather[c("lon", "lat")])

一旦计算出所有距离，我就不确定下一步该怎么做，但是我认为我需要一些东西来选择最近的车站和比赛日期。这是我想出的最好的方法：

dfSites %>% 
    mutate(closestStation = ???,
           temp1 = temp[station == closestStation & stationDate == siteDate])

最终结果是我的研究地点数据框，其中还有来自最近气象站的温度附加栏。

Answer 1

我认为distVincentyEllipsoid(p1, p2, ...)试图找到p1的第一点与p2的第一点，p1的第二点和{{1}的第二点之间的距离}等。您需要按照*“的方向展开，首先在p1中针对所有p2，然后在p2中对所有p1进行扩展，等等）。

调整代码以调用p2和dfSites（而不是dfStation / df2），以下内容将为您工作。（我将使用weather just 删除其中一个站点，以清楚地识别哪个维度代表站点与站点。

dfStation[-1,...]

（因为我们有14行，所以每一行都是您的工作站之一。您不应该进行alldists <- sapply(seq_len(nrow(dfSites)), function(i) { distVincentyEllipsoid(dfSites[i,c("siteLon","siteLat")], dfStation[-1,c("stationLon","stationLat")]) }) alldists # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] # [1,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 # [2,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 # [3,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 # [4,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 # [5,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 # [6,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 # [7,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 # [8,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 # [9,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 # [10,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 # [11,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 # [12,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 # [13,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 # [14,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 # [,9] [,10] [,11] [,12] [,13] [,14] [,15] # [1,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 # [2,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 # [3,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 # [4,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 # [5,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 # [6,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 # [7,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 # [8,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 # [9,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 # [10,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 # [11,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 # [12,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 # [13,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 # [14,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2索引，只知道哪一行是行/列。）我们知道站点[-1,]和站点A之间的差是481351.6米（第一列，第二行）。

从这里开始，找到最小列：

建议距离站点apply(alldists, 2, which.min) # [1] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2最近的电台是A（b将返回第一个最小值，不表示平局）。

现在，which.min为您提供了15行的站数据，可以轻松对其进行dfStation[apply(alldists, 2, which.min),]编辑或与cbind组合。

dfSites选项：

dplyr

通过做它们的外积，可以看到轻微（10-15％）的速度提高。

dfSites %>%
  mutate(
    station_i = purrr::map2_int(
      siteLat, siteLon,
      ~ which.min(geosphere::distVincentyEllipsoid(
          cbind(.x,.y), dfStation[-1,c("stationLon","stationLat")]))
      ),
    station = as.character(dfStation$station)[ station_i ]
  )
#    site siteLat siteLon siteDate station_i station
# 1     A      41      68        1         3       c
# 2     B      42      62        2         1       a
# 3     C      44      63        3         2       b
# 4     A      41      68        4         3       c
# 5     B      42      62        5         1       a
# 6     C      44      63        1         2       b
# 7     A      41      68        2         3       c
# 8     B      42      62        3         1       a
# 9     C      44      63        4         2       b
# 10    A      41      68        5         3       c
# 11    B      42      62        1         1       a
# 12    C      44      63        2         2       b
# 13    A      41      68        3         3       c
# 14    B      42      62        4         1       a
# 15    C      44      63        5         2       b

这还将返回一个outer(seq_len(nrow(dfSites)), seq_len(nrow(dfStation)), function(i,j) geosphere::distVincentyEllipsoid(dfSites[i,2:3], dfStation[j,2:3])) x m矩阵（工作站行），然后您n将其穿过以获得最接近的索引。（我希望获得更大的性能提升，因为apply(...)仅被调用一次...）

确定最近的站点并从该位置选择另一个变量

1 个答案: