Question

我正在尝试从具有几种条件的数据中获取一个值。我得到的数据是1个月内1个文件，而我得到的数据不是连续时间。数据看起来像这样

                                        measure       value
1                            Station identifier        WAML
2                                Station number       97072
3                              Observation time 150101/0000
...
27       Mean mixed layer potential temperature      298.68
28                Mean mixed layer mixing ratio       16.77
29                1000 hPa to 500 hPa thickness     5773.00
30  Precipitable water [mm] for entire sounding       55.86
31                           Station identifier        WAML
32                               Station number       97072
33                             Observation time 150109/1200
...
57       Mean mixed layer potential temperature      300.78
58                Mean mixed layer mixing ratio       16.29
59                1000 hPa to 500 hPa thickness     5784.00
60  Precipitable water [mm] for entire sounding       52.46
61                           Station identifier        WAML
62                               Station number       97072
63                             Observation time 150110/0000
...
87       Mean mixed layer potential temperature      297.48
88                Mean mixed layer mixing ratio       16.55
89                1000 hPa to 500 hPa thickness     5760.00
90                           Station identifier        WAML
91                               Station number       97072
92                             Observation time 150110/1200
...

我想通过“观测时间”和“整个测深的可降水量[mm]”过滤数据，这样我就可以获得值。但是在某些情况下，一次观测没有可降水量数据，只有观测时间带有其他参数。

我尝试使用：

df1 <-  dplyr::filter(obs.tpw, grepl(paste(c("Observation time", "Precipitable water [mm] for entire sounding"), collapse = "&"), paste(measure, value, sep = "_")))

但是那里没有数据

如何仅获取数据和观测值时间以及可沉淀水参数，然后按顺序排列它们。观察时间值为'data'/'time'，150101为（年）（月）（日）/（小时）（分钟）。我得到的数据未按日期和小时排序。例如，第一次观察时间为150101/0000，然后第二次为150109/1200，第二次应为150101/1200，因为一天中的观察次数是观察次数的2倍（0000和1200）

我想要的最终数据如下：

                                       measure       value
1                             Observation time 150101/0000
2  Precipitable water [mm] for entire sounding       55.86
3                             Observation time 150101/1200
4  Precipitable water [mm] for entire sounding       52.46
5                             Observation time 150102/0000
6  Precipitable water [mm] for entire sounding       61.15
7                             Observation time 150102/1200
8  Precipitable water [mm] for entire sounding       55.93
9                             Observation time 150103/0000
10 Precipitable water [mm] for entire sounding       52.25
11                            Observation time 150103/1200
12 Precipitable water [mm] for entire sounding       61.48
13                            Observation time 150104/0000
14 Precipitable water [mm] for entire sounding          NA
15                            Observation time 150104/1200
16 Precipitable water [mm] for entire sounding       61.92
17                            Observation time 150105/0000
18 Precipitable water [mm] for entire sounding          NA
19                            Observation time 150105/1200
20 Precipitable water [mm] for entire sounding       57.42

Answer 1

我做出了以下假设，这些假设在您上面的问题中不清楚（如果这些假设不正确，我会根据需要修改答案）：

通过Station identifier，Station number和Observation time的组合来表示唯一的观察结果
每个观察都包含这三个标识符，并且它们始终以相同的顺序直接出现在与该观察相关的数据之前
我对Observation time中使用的时间日期格式一无所知，但我猜想它与'date'/'time'相似，其中'date'是整数序列指的是在特定参考日期之后的天数。

首先，请尝试在这些问题中包含可复制的数据集，或尝试链接到公开可用的数据：

# Create Reproducible Dataset ---------------------------------------------
measure <- c("Station identifier", 
             "Station number", 
             "Observation time", "Mean mixed layer potential temperature", 
             "Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness",
             "Precipitable water [mm] for entire sounding", "Station identifier", 
             "Station number", "Observation time", 
             "Mean mixed layer potential temperature",
             "Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness", 
             "Precipitable water [mm] for entire sounding", "Station identifier", 
             "Station number", "Observation time", 
             "Mean mixed layer potential temperature", 
             "Mean mixed layer mixing ratio", 
             "1000 hPa to 500 hPa thickness", "Station identifier", 
             "Station number", "Observation time")
value <- c("WAML", "97072", "150101/0000", "298.68", "16.77", "5773.00", "55.86", 
           "WAML", "97072", "150109/1200", "300.78", "16.29", "5784.00", "52.46", 
           "WAML", "97072", "150110/0000", "297.48", "16.55", "5760.00", "WAML", 
           "97072", "150110/1200")
df <- data.frame(measure = measure, value = value, stringsAsFactors = FALSE)

现在您的问题：

# Solution ----------------------------------------------------------------

# Create index of rows where `measure == "Station identifier"`
idx <- which(df$measure == "Station identifier")

df %>% 
    # Create Unique Identifier for each station
    dplyr::mutate(station_id = cut(1:nrow(df), 
                                   c(idx, nrow(df)),
                                   right = FALSE, 
                                   include.lowest = TRUE)) %>% 
    dplyr::filter(measure %in% c("Observation time", 
                                 "Precipitable water [mm] for entire sounding")) %>% 
    # Turn each value in measure to a new column
    tidyr::pivot_wider(names_from = "measure", values_from = "value", ) %>% 
    # Inelegant way of sorting by date and time
    dplyr::mutate(ot =  as.numeric(sub("\\/", ".", `Observation time`))) %>% 
    dplyr::arrange(ot) %>% 
    dplyr::select(-ot) %>% 
    tidyr::drop_na()

最后，我想指出的是，尽管您可以使用tidyverse品牌的数据包来解析和分析这些数据，但如果您的研究领域需要频繁使用地理空间，时空，或大气数据，似乎已经有大量的R软件包专用于此目的。我在这方面绝对没有经验，但是从我的简要搜索中可以发现，CRAN上的spacetime软件包似乎很有希望，因为它可以处理这种格式的数据。另一个有用的资源是Edzer Pebesma的以下入门知识。

我希望这是有用的。

在R中按条件选择行

1 个答案: