我有这个data.frame
数据
df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
major_rock = c("greywacke", "mudstone", "gravel", "greywacke", "gravel", "mudstone", "gravel", "mudstone", "mudstone",
"conglomerate", "gravel", "mudstone", "greywacke","conglomerate", "gravel", "gravel", "greywacke","gravel",
"greywacke", "gravel", "mudstone", "greywacke", "gravel", "gravel", "gravel", "conglomerate", "greywacke",
"coquina", "gravel", "gravel", "greywacke", "gravel", "mudstone","mudstone", "gravel"),
minor_rock = c("sandstone mudstone basalt chert limestone", "limestone", "sand silt clay", "sandstone mudstone basalt chert limestone",
"sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone", "mudstone sandstone coquina limestone",
"sandstone mudstone limestone", "sand loess silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
"sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
"sand silt clay", "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
"sand loess silt", "sand silt clay", "loess silt sand", "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
"limestone", "loess sand silt", NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
"mudstone sandstone coquina limestone", "mudstone tephra loess"),
area_ha = c(1066.68, 7.59, 3.41, 4434.76, 393.16, 361.69, 306.75, 124.93, 95.84, 9.3, 8.45, 4565.89, 2600.44, 2198.52,
2131.71, 2050.09, 1640.47, 657.09, 296.73, 178.12, 10403.53, 8389.2, 8304.08, 3853.36, 2476.36, 2451.25,
1640.47, 1023.02, 532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))
其中有4个网站,其中2个是 独立 (site1
和site3
;它们不包含任何网站上游)和2是 依赖 (site2
和site4
;它们包括上游网站)
我想创建一个新的data.frame,我们称之为df_indep
。其中,我希望所有网站都 独立 ,这意味着从依赖网站中减去任何上游网站,如下所示
site1和site3将保持相同,因为它们是独立的
site2(独立)= site2 - site1
site4(独立)= site4 - (site2 + site3)
以下df
仅适用于major_rock
和minor_rock
组合,area_percent
大于15%(在减去上游网站之前; site2
和{{1 }})
site3
这是
最终结果
我想减去上游网站后
library(dplyr)
head(df %>% group_by(id) %>%
mutate(area_percent = area_ha/sum(area_ha)*100) %>%
filter(area_percent>5),15)
# id major_rock minor_rock area_ha area_percent
# <fctr> <fctr> <fctr> <dbl> <dbl>
#1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 98.979289
#2 site2 greywacke sandstone mudstone basalt chert limestone 4434.76 77.329604
#3 site2 gravel sand silt clay 393.16 6.855592
#4 site2 mudstone sandstone conglomerate coquina tephra 361.69 6.306845
#5 site2 gravel NA 306.75 5.348848
#6 site3 mudstone sandstone conglomerate coquina tephra 4565.89 27.978879
#7 site3 greywacke sandstone mudstone basalt chert limestone 2600.44 15.934986
#8 site3 conglomerate sandstone mudstone limestone 2198.52 13.472099
#9 site3 gravel sand loess silt 2131.71 13.062701
#10 site3 gravel loess silt sand 2050.09 12.562550
#11 site3 greywacke sandstone mudstone conglomerate chert limestone basalt 1640.47 10.052479
#12 site4 mudstone sandstone conglomerate coquina tephra 10403.53 25.925869
#13 site4 greywacke sandstone mudstone basalt chert limestone 8389.20 20.906106
#14 site4 gravel sand loess silt 8304.08 20.693984
#15 site4 gravel sand silt clay 3853.36 9.602674
我将非常感谢有关如何在R中执行此操作的任何建议。
更新
这是显示所有4个网站的地图
下图显示在减去# id major_rock minor_rock area_ha area_percent
#1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 98.979289
#2 site2 greywacke sandstone mudstone basalt chert limestone 3368.08 72.319849
#3 site2 gravel sand silt clay 389.75 8.368762
#4 site2 mudstone sandstone conglomerate coquina tephra 361.69 7.766254
#5 site2 gravel NA 306.75 6.586576
#6 site3 mudstone sandstone conglomerate coquina tephra 4565.89 27.978879
#7 site3 greywacke sandstone mudstone basalt chert limestone 2600.44 15.934986
#8 site3 conglomerate sandstone mudstone limestone 2198.52 13.472099
#9 site3 gravel sand loess silt 2131.71 13.062701
#10 site3 gravel loess silt sand 2050.09 12.562550
#11 site3 greywacke sandstone mudstone conglomerate chert limestone basalt 1640.47 10.052479
#12 site4 mudstone sandstone conglomerate coquina tephra 5475.95 30.297305
#13 site4 greywacke sandstone mudstone basalt chert limestone 1354.00 7.491403
#14 site4 gravel sand loess silt 6163.92 34.103701
#15 site4 gravel sand silt clay 2803.11 15.509031
和{{1}后,在最终输出中我想要的site4
(累计为df
)和site1
(独立) }
下图显示了site2(累计)和indepenendent的相同内容
关于@ rbierman关于网站依赖性如何编码的问题,请查看以下内容。
site2
答案 0 :(得分:1)
这不是太糟糕,只需稍加重命名和加入。
首先,我们希望以一种漂亮的双列格式存在依赖关系。您可以对发布的广泛依赖关系使用reshape2::melt
或tidyr::gather
以使其更长:
deps = data.frame(
id = c("site2", "site4", "site4"),
dependency = c("site1", "site2", "site3"),
stringsAsFactors = FALSE
)
# id dependency
# 1 site2 site1
# 2 site4 site2
# 3 site4 site3
使用dplyr
进行加入时,我们还需要character
而不是factor
列,以防万一级别相同。
library(dplyr)
df = mutate_at(df, .cols = c("id", "major_rock", "minor_rock"), .funs = funs(as.character))
首先,我们使用measure&#34;创建&#34;依赖关系。对于区域和ID(编辑)具有明确依赖名称的数据框,然后我们将其汇总到id
级别,对相关区域求和: / p>
dep_w_measure = df %>%
select(dependency = id, major_rock, minor_rock, dep_area = area_ha) %>%
inner_join(deps) %>%
group_by(id, major_rock, minor_rock) %>%
summarize(dep_area = sum(dep_area))
然后我们将其与原始数据相连,并减去相关区域(如果存在):
result = left_join(df, dep_w_measure, by = c("major_rock", "minor_rock", "id")) %>%
mutate(area_ind = area_ha - coalesce(dep_area, 0))
head(result)
# id major_rock minor_rock area_ha dep_area area_ind
# 1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 NA 1066.68
# 2 site1 mudstone limestone 7.59 NA 7.59
# 3 site1 gravel sand silt clay 3.41 NA 3.41
# 4 site2 greywacke sandstone mudstone basalt chert limestone 4434.76 1066.68 3368.08
# 5 site2 gravel sand silt clay 393.16 3.41 389.75
# 6 site2 mudstone sandstone conglomerate coquina tephra 361.69 NA 361.69
我将dep_area
和area_ha
列留在&#34;显示我的工作&#34;,您可以根据需要清理它。独立区域area_ind
列与所需输出中的area_ha
匹配。