我有一个很大的data.table
(这里只显示五行)。
taxpath N
Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48; 57
Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8; 54
Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA; 53
Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA; 41
Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84; 41
第一列是taxpath
(门,类,顺序,族,属,从左到右的物种),第二列是N
,每个税道的出现频率。
我想要做的是用分号拆分每个税道并使用第一个条目。
我想计算每个门等级(第一等级,所以Bacteriodetes,Proteobacteria或Planctomycetes)的出现频率。但是,此数字应与N列中的值相乘。
所以,我的期望或多或少是这样的。
phylum Nnew
Bacteriodetes 111
Proteobacteria 94
Planctomycetes 41
你能帮我解决如何在列中进行拆分和 - 我想 - group-by乘以N列?
(PS:稍后,我也希望与列税路径中的其他元素一起使用,但我认为将其分配到单独的表中更容易)
答案 0 :(得分:2)
This tagged data.table so here's a simple data.table solution.
library(data.table)
DT[, .(Nnew = sum(N)), by = sub(";.*", "", taxpath)]
# sub Nnew
# 1: Bacteroidetes 111
# 2: Proteobacteria 94
# 3: Planctomycetes 41
We basically summed N
while extracting the first part of taxpath
on the fly in the by
statement
Data
DT <- fread("taxpath\t N
Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48;\t 57
Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8;\t 54
Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA;\t 53
Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA;\t 41
Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84;\t 41")
答案 1 :(得分:1)
我们可以使用separate
根据分隔符;
将'taxpath'拆分为指定的列,按'门'分组,获取'{1}}的'N'
sum