我正在尝试从具有几种条件的数据中获取一个值。我得到的数据是1个月内1个文件,而我得到的数据不是连续时间。数据看起来像这样
measure value
1 Station identifier WAML
2 Station number 97072
3 Observation time 150101/0000
...
27 Mean mixed layer potential temperature 298.68
28 Mean mixed layer mixing ratio 16.77
29 1000 hPa to 500 hPa thickness 5773.00
30 Precipitable water [mm] for entire sounding 55.86
31 Station identifier WAML
32 Station number 97072
33 Observation time 150109/1200
...
57 Mean mixed layer potential temperature 300.78
58 Mean mixed layer mixing ratio 16.29
59 1000 hPa to 500 hPa thickness 5784.00
60 Precipitable water [mm] for entire sounding 52.46
61 Station identifier WAML
62 Station number 97072
63 Observation time 150110/0000
...
87 Mean mixed layer potential temperature 297.48
88 Mean mixed layer mixing ratio 16.55
89 1000 hPa to 500 hPa thickness 5760.00
90 Station identifier WAML
91 Station number 97072
92 Observation time 150110/1200
...
我想通过“观测时间”和“整个测深的可降水量[mm]”过滤数据,这样我就可以获得值。但是在某些情况下,一次观测没有可降水量数据,只有观测时间带有其他参数。
我尝试使用:
df1 <- dplyr::filter(obs.tpw, grepl(paste(c("Observation time", "Precipitable water [mm] for entire sounding"), collapse = "&"), paste(measure, value, sep = "_")))
但是那里没有数据
如何仅获取数据和观测值时间以及可沉淀水参数,然后按顺序排列它们。观察时间值为'data'/'time'
,150101为(年)(月)(日)/(小时)(分钟)。我得到的数据未按日期和小时排序。例如,第一次观察时间为150101/0000,然后第二次为150109/1200,第二次应为150101/1200,因为一天中的观察次数是观察次数的2倍(0000和1200)
我想要的最终数据如下:
measure value
1 Observation time 150101/0000
2 Precipitable water [mm] for entire sounding 55.86
3 Observation time 150101/1200
4 Precipitable water [mm] for entire sounding 52.46
5 Observation time 150102/0000
6 Precipitable water [mm] for entire sounding 61.15
7 Observation time 150102/1200
8 Precipitable water [mm] for entire sounding 55.93
9 Observation time 150103/0000
10 Precipitable water [mm] for entire sounding 52.25
11 Observation time 150103/1200
12 Precipitable water [mm] for entire sounding 61.48
13 Observation time 150104/0000
14 Precipitable water [mm] for entire sounding NA
15 Observation time 150104/1200
16 Precipitable water [mm] for entire sounding 61.92
17 Observation time 150105/0000
18 Precipitable water [mm] for entire sounding NA
19 Observation time 150105/1200
20 Precipitable water [mm] for entire sounding 57.42
答案 0 :(得分:2)
我做出了以下假设,这些假设在您上面的问题中不清楚(如果这些假设不正确,我会根据需要修改答案):
Station identifier
,Station number
和Observation time
的组合来表示唯一的观察结果Observation time
中使用的时间日期格式一无所知,但我猜想它与'date'/'time'
相似,其中'date'
是整数序列指的是在特定参考日期之后的天数。首先,请尝试在这些问题中包含可复制的数据集,或尝试链接到公开可用的数据:
# Create Reproducible Dataset ---------------------------------------------
measure <- c("Station identifier",
"Station number",
"Observation time", "Mean mixed layer potential temperature",
"Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness",
"Precipitable water [mm] for entire sounding", "Station identifier",
"Station number", "Observation time",
"Mean mixed layer potential temperature",
"Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness",
"Precipitable water [mm] for entire sounding", "Station identifier",
"Station number", "Observation time",
"Mean mixed layer potential temperature",
"Mean mixed layer mixing ratio",
"1000 hPa to 500 hPa thickness", "Station identifier",
"Station number", "Observation time")
value <- c("WAML", "97072", "150101/0000", "298.68", "16.77", "5773.00", "55.86",
"WAML", "97072", "150109/1200", "300.78", "16.29", "5784.00", "52.46",
"WAML", "97072", "150110/0000", "297.48", "16.55", "5760.00", "WAML",
"97072", "150110/1200")
df <- data.frame(measure = measure, value = value, stringsAsFactors = FALSE)
现在您的问题:
# Solution ----------------------------------------------------------------
# Create index of rows where `measure == "Station identifier"`
idx <- which(df$measure == "Station identifier")
df %>%
# Create Unique Identifier for each station
dplyr::mutate(station_id = cut(1:nrow(df),
c(idx, nrow(df)),
right = FALSE,
include.lowest = TRUE)) %>%
dplyr::filter(measure %in% c("Observation time",
"Precipitable water [mm] for entire sounding")) %>%
# Turn each value in measure to a new column
tidyr::pivot_wider(names_from = "measure", values_from = "value", ) %>%
# Inelegant way of sorting by date and time
dplyr::mutate(ot = as.numeric(sub("\\/", ".", `Observation time`))) %>%
dplyr::arrange(ot) %>%
dplyr::select(-ot) %>%
tidyr::drop_na()
最后,我想指出的是,尽管您可以使用tidyverse
品牌的数据包来解析和分析这些数据,但如果您的研究领域需要频繁使用地理空间,时空,或大气数据,似乎已经有大量的R软件包专用于此目的。我在这方面绝对没有经验,但是从我的简要搜索中可以发现,CRAN上的spacetime软件包似乎很有希望,因为它可以处理这种格式的数据。另一个有用的资源是Edzer Pebesma的以下入门知识。
我希望这是有用的。