如果R

时间:2018-12-11 21:13:44

标签: r data.table

我们有两个数据框。 station_data包含地理日级别的天气预报。 tavg_monthly包含地理月份级别tvag的分位数。如果TRUE中的观察值大于或等于75%或小于25%(存储在station_data中为{{ 1}}或tavg_monthly)表示“极端天气”。观察结果按tavg_monthly$75%tavg_monthly$75%分组。

station_data示例

fips

示例prcp_monthly

month

使用以下行

 structure(list(fips = c("01073", "01073", "01073", "01073", "01073", 
 "01073"), rain = c(0, 0, 0, 0, 0, 0), year = c("1980", "1980", 
 "1980", "1980", "1980", "1980"), week = c(1L, 1L, 1L, 1L, 1L, 
 1L), month = c("01", "01", "01", "01", "01", "01"), day = c("001", 
 "002", "003", "004", "005", "006"), tavg = c(3.32500010728836, 
 4.64999985694885, 7.77500009536743, 4.3125, 0, 1.86249995231628
 )), row.names = c(NA, 6L), class = "data.frame")

我在结果栏中多了一行,但是它们不一致(即有时是错误的)。我收到50多种形式的警告

 structure(list(fips = c("01073", "01073", "01073", "01073", "01073", 
 "01073"), month = c("01", "02", "03", "04", "05", "06"), 
 `25%` = c(2.68333338201046, 
 4.65000009536743, 8.86249977350235, 13.8229166865349, 18.7999997138977, 
 23.7364585399628), `75%` = c(9.79999996721745, 12.1333334445953, 
 16.3260417580605, 20.1833333969116, 23.6843748092651, 26.5312495231628
 ), n = c(1116L, 1017L, 1116L, 1080L, 1116L, 1080L)), row.names = c(NA, 
 6L), class = "data.frame")

其中35868/12个月= 3082(我的地理单位数量)和1116磅。 = 36年的数据*完整数据集中的31天(例如1月)。

结果是:

setDT(station_data)[, extr_tavg_monthly := station_data$tavg>=prcp_monthly$`75%` | output$tavg<=input$`25%` , by = list(fips, month)]

应该是,

In `[.data.table`(setDT(station_data), , `:=`(extr_prcp_monthly,  ...:
RHS 1 is length (greater than the size (1116) of group 25). The 
last 35868 element(s) will be discarded.

假设 fips rain year week month day tavg extr_tavg_monthly 1: 01073 0 1980 1 01 001 3.3250 FALSE 2: 01073 0 1980 1 01 002 4.6500 TRUE 3: 01073 0 1980 1 01 003 7.7750 TRUE 4: 01073 0 1980 1 01 004 4.3125 TRUE 5: 01073 0 1980 1 01 005 0.0000 TRUE 6: 01073 0 1980 1 01 006 1.8625 TRUE fips rain year week month day tavg extr_tavg_monthly 1: 01073 0 1980 1 01 001 3.3250 FALSE 2: 01073 0 1980 1 01 002 4.6500 FALSE 3: 01073 0 1980 1 01 003 7.7750 FALSE 4: 01073 0 1980 1 01 004 4.3125 FALSE 5: 01073 0 1980 1 01 005 0.0000 TRUE 6: 01073 0 1980 1 01 006 1.8625 TRUE 的四分位数是

month=01

2 个答案:

答案 0 :(得分:1)

或者,可以使用“非设备更新连接”解决此问题:

library(data.table)
setDT(station_data)[setDT(prcp_monthly), 
             on = .(fips, month, tavg >= `25%`, tavg < `75%`), 
             extr_tavg_monthly := FALSE][
               is.na(extr_tavg_monthly), extr_tavg_monthly := TRUE][]
    fips rain year week month day   tavg extr_tavg_monthly
1: 01073    0 1980    1    01 001 3.3250             FALSE
2: 01073    0 1980    1    01 002 4.6500             FALSE
3: 01073    0 1980    1    01 003 7.7750             FALSE
4: 01073    0 1980    1    01 004 4.3125             FALSE
5: 01073    0 1980    1    01 005 0.0000              TRUE
6: 01073    0 1980    1    01 006 1.8625              TRUE

请注意,除了extr_tavg_monthly之外,没有其他列被添加到桩号数据集中。这与this answer形成鲜明对比,comment还将25%75%列添加到station_data

编辑

如果我从OP的@David C. Rankin正确理解,则要求extr_tavg_monthly应该为NA,以防丢失tavg。只需稍加修改即可实现。

# create 2nd dataset by appending an additional row containing NA
station_data2 <- rbind(setDT(station_data), station_data[.N])
station_data2[.N, `:=`(day = "007", tavg = NA)]
station_data2
    fips rain year week month day   tavg
1: 01073    0 1980    1    01 001 3.3250
2: 01073    0 1980    1    01 002 4.6500
3: 01073    0 1980    1    01 003 7.7750
4: 01073    0 1980    1    01 004 4.3125
5: 01073    0 1980    1    01 005 0.0000
6: 01073    0 1980    1    01 006 1.8625
7: 01073    0 1980    1    01 007     NA
station_data2[setDT(prcp_monthly), 
              on = .(fips, month, tavg >= `25%`, tavg < `75%`), 
              extr_tavg_monthly := FALSE][
                is.na(extr_tavg_monthly) & !is.na(tavg), extr_tavg_monthly := TRUE]
station_data2
    fips rain year week month day   tavg extr_tavg_monthly
1: 01073    0 1980    1    01 001 3.3250             FALSE
2: 01073    0 1980    1    01 002 4.6500             FALSE
3: 01073    0 1980    1    01 003 7.7750             FALSE
4: 01073    0 1980    1    01 004 4.3125             FALSE
5: 01073    0 1980    1    01 005 0.0000              TRUE
6: 01073    0 1980    1    01 006 1.8625              TRUE
7: 01073    0 1980    1    01 007     NA                NA

答案 1 :(得分:0)

有效的方法正在四分位合并,所以我想原因是警告消息中给出的长度不匹配。

setDT(station_data)[setDT(tavg_monthly), `25%` := `25%`, on=c("fips", "month")]
setDT(station_data)[setDT(tavg_monthly), `75%` := `75%`, on=c("fips", "month")]
setDT(station_data)[, extr_tavg_monthly :=tavg>=`75%` | tavg<=`25%`, by = list(fips, month)]