我正在尝试简化使用两个SQL
查询(最小到一个)的分析。为此,我将生物量数据与单个SQL
查询中的大小类数据结合在一起,从而创建了重复项。这是因为生物量已经是一个总和,并且是每个taxa_name
中site
的总生物量,即它是我的新表中的一对多值。
为了摆脱2个SQL
查询,我通过两次data.table
操作和最后的联接完成了工作。一种替代方法是进行计算并删除重复项两次。有没有一种方法可以仅通过使用data.table
来避免这两种情况?
testdf <- structure(list(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L), abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L), biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5), size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L), site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L), taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"), lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L)), row.names = c(NA, -15L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00362498>)
# biomass
bm <- testdf
bm <- bm[, .(site = unique(site)),
by = list(spcode, taxa_name, biomass)][, totbm := sum(biomass), by = list(spcode)][!duplicated(spcode), c(1,5)]
> bm
spcode totbm
1: 10008 0.5
2: 10002 0.3
3: 10006 0.6
4: 10011 0.5
接下来完成丰度,然后在spcode
上将两者合并。
# abundance
testdf <- testdf[, .(totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode, taxa_name)]
# join
testdf[bm, on = 'spcode', bm := i.totbm]
> testdf
spcode taxa_name totabn n minlngth maxlngth bm
1: 10008 Hippoglossina stomata 85 4 20 23 0.5
2: 10002 Symphurus atricaudus 83 7 5 16 0.3
3: 10006 Microstomus pacificus 85 8 9 14 0.6
4: 10011 Parophrys vetulus 17 1 17 17 0.5
testdf
的上述输出是我想要的输出。我的其他尝试依赖于两个!duplicated
调用。在我的脑海中,我希望能够在丰度计算中使用[, totbm := sum(biomass), by = list(unique(site), spcode)]
,但这是行不通的。
testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)), by = list(spcode, taxa_name)][, totbm := sum(biomass), by = list(unique(site), spcode)]
Error in `[.data.table`(testdf[, .(site = (site), biomass = biomass, totabn = sum(lnXabun), : The items in the 'by' or 'keyby' list are length (3,15). Each must be length 15; the same length as there are rows in x (after subsetting if i is provided).
替代方法:
alt <- bm[, .(site = site, taxa_name = taxa_name, biomass = biomass, totabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class)),
by = list(spcode)]
alt <- alt[!duplicated(alt, by = c("site", "spcode"))]
alt[, totbm := sum(biomass), by = list(spcode)]
alt[!duplicated(alt, by = "spcode"), c(1,3,5:9)]
答案 0 :(得分:3)
就像我在评论中提到的那样,我不喜欢数据冗余的表,但这是解决问题的一种方法。基本上,不是使用某种“独特”功能,而是按站点/ taxa_name的组来输入索引号,以便可以将除第一个生物量值之外的所有值都设置为0。然后按spcode / taxa_name进行的总和应该可以正常工作。当然,这是假定一组site / taxa_name值恰好对应一个生物量值。
testdf <- data.table(spcode = c(10008L, 10008L, 10002L, 10002L, 10006L, 10008L, 10008L, 10002L, 10002L, 10011L, 10002L, 10002L, 10006L, 10006L, 10006L),
abundance = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 2L),
biomass = c(0.2, 0.2, 0.1, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.1, 0.5, 0.5, 0.5),
size_class = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 13L, 17L, 12L, 5L, 9L, 10L, 11L),
site = c(907L, 907L, 907L, 907L, 907L, 914L, 914L, 914L, 914L, 914L, 910L, 910L, 910L, 910L, 910L),
taxa_name = c("Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Hippoglossina stomata", "Hippoglossina stomata", "Symphurus atricaudus", "Symphurus atricaudus", "Parophrys vetulus", "Symphurus atricaudus", "Symphurus atricaudus", "Microstomus pacificus", "Microstomus pacificus", "Microstomus pacificus"),
lnXabun = c(21L, 20L, 14L, 10L, 14L, 21L, 23L, 16L, 26L, 17L, 12L, 5L, 9L, 40L, 22L))
testdf[, biomassIdx := 1:.N, by = c('site', 'taxa_name')]
testdf[biomassIdx > 1, biomass := 0]
testdf[, .(tatabn = sum(lnXabun), n = sum(abundance), minlngth = min(size_class), maxlngth = max(size_class) , bm = sum(biomass)),
by = list(spcode, taxa_name)]
答案 1 :(得分:1)
除非我缺少任何东西,否则您会使自己复杂化一点。 只需做一个不同的摘要即可:
bm <- testdf[, .SD[1L], by = list(spcode, taxa_name, biomass, site) # distinct
][, .(totbm = sum(biomass)), by = "spcode"] # summary