Question

我有这个data.frame

数据

df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
                 major_rock = c("greywacke",    "mudstone", "gravel",   "greywacke",    "gravel",   "mudstone", "gravel", "mudstone", "mudstone",   
                                "conglomerate", "gravel", "mudstone",   "greywacke","conglomerate", "gravel",   "gravel",   "greywacke","gravel",   
                                "greywacke",    "gravel",   "mudstone", "greywacke",    "gravel", "gravel", "gravel",   "conglomerate", "greywacke",
                                "coquina",  "gravel",   "gravel",   "greywacke",    "gravel",   "mudstone","mudstone",  "gravel"),
                 minor_rock = c("sandstone mudstone basalt chert limestone",  "limestone",   "sand silt clay", "sandstone mudstone basalt chert limestone",
                                "sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone",  "mudstone sandstone coquina limestone",
                                "sandstone mudstone limestone",  "sand loess silt",  "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
                                "sand silt clay",  "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sand loess silt", "sand silt clay", "loess silt sand",  "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
                                "limestone", "loess sand silt",  NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
                                "mudstone sandstone coquina limestone", "mudstone tephra loess"),
                 area_ha = c(1066.68,   7.59,   3.41,   4434.76,    393.16, 361.69, 306.75, 124.93, 95.84,  9.3,    8.45,   4565.89,    2600.44,    2198.52,    
                             2131.71,   2050.09,    1640.47,    657.09, 296.73, 178.12, 10403.53,   8389.2,  8304.08,   3853.36,    2476.36,    2451.25,    
                             1640.47,   1023.02,    532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))

其中有4个网站，其中2个是独立（site1和site3;它们不包含任何网站上游）和2是依赖（site2和site4;它们包括上游网站）

我想创建一个新的data.frame，我们称之为df_indep。其中，我希望所有网站都独立，这意味着从依赖网站中减去任何上游网站，如下所示

site1和site3将保持相同，因为它们是独立的

site2（独立）= site2 - site1

site4（独立）= site4 - （site2 + site3）

以下df仅适用于major_rock和minor_rock组合，area_percent大于15％（在减去上游网站之前; site2和{{1 }}）

site3

这是

最终结果

我想减去上游网站后

library(dplyr)
head(df %>% group_by(id) %>% 
       mutate(area_percent = area_ha/sum(area_ha)*100) %>% 
       filter(area_percent>5),15)


#       id   major_rock                                             minor_rock  area_ha area_percent
#   <fctr>       <fctr>                                                 <fctr>    <dbl>        <dbl>
#1   site1    greywacke              sandstone mudstone basalt chert limestone  1066.68    98.979289
#2   site2    greywacke              sandstone mudstone basalt chert limestone  4434.76    77.329604
#3   site2       gravel                                         sand silt clay   393.16     6.855592
#4   site2     mudstone                  sandstone conglomerate coquina tephra   361.69     6.306845
#5   site2       gravel                                                     NA   306.75     5.348848
#6   site3     mudstone                  sandstone conglomerate coquina tephra  4565.89    27.978879
#7   site3    greywacke              sandstone mudstone basalt chert limestone  2600.44    15.934986
#8   site3 conglomerate                           sandstone mudstone limestone  2198.52    13.472099
#9   site3       gravel                                        sand loess silt  2131.71    13.062701
#10  site3       gravel                                        loess silt sand  2050.09    12.562550
#11  site3    greywacke sandstone mudstone conglomerate chert limestone basalt  1640.47    10.052479
#12  site4     mudstone                  sandstone conglomerate coquina tephra 10403.53    25.925869
#13  site4    greywacke              sandstone mudstone basalt chert limestone  8389.20    20.906106
#14  site4       gravel                                        sand loess silt  8304.08    20.693984
#15  site4       gravel                                         sand silt clay  3853.36     9.602674

我将非常感谢有关如何在R中执行此操作的任何建议。

更新

这是显示所有4个网站的地图

下图显示在减去# id major_rock minor_rock area_ha area_percent #1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 98.979289 #2 site2 greywacke sandstone mudstone basalt chert limestone 3368.08 72.319849 #3 site2 gravel sand silt clay 389.75 8.368762 #4 site2 mudstone sandstone conglomerate coquina tephra 361.69 7.766254 #5 site2 gravel NA 306.75 6.586576 #6 site3 mudstone sandstone conglomerate coquina tephra 4565.89 27.978879 #7 site3 greywacke sandstone mudstone basalt chert limestone 2600.44 15.934986 #8 site3 conglomerate sandstone mudstone limestone 2198.52 13.472099 #9 site3 gravel sand loess silt 2131.71 13.062701 #10 site3 gravel loess silt sand 2050.09 12.562550 #11 site3 greywacke sandstone mudstone conglomerate chert limestone basalt 1640.47 10.052479 #12 site4 mudstone sandstone conglomerate coquina tephra 5475.95 30.297305 #13 site4 greywacke sandstone mudstone basalt chert limestone 1354.00 7.491403 #14 site4 gravel sand loess silt 6163.92 34.103701 #15 site4 gravel sand silt clay 2803.11 15.509031和{{1}后，在最终输出中我想要的site4（累计为df）和site1（独立） }

下图显示了site2（累计）和indepenendent的相同内容

关于@ rbierman关于网站依赖性如何编码的问题，请查看以下内容。

site2

Answer 1

这不是太糟糕，只需稍加重命名和加入。

首先，我们希望以一种漂亮的双列格式存在依赖关系。您可以对发布的广泛依赖关系使用reshape2::melt或tidyr::gather以使其更长：

deps = data.frame(
    id = c("site2", "site4", "site4"),
    dependency = c("site1", "site2", "site3"),
    stringsAsFactors = FALSE
)
#      id dependency
# 1 site2      site1
# 2 site4      site2
# 3 site4      site3

使用dplyr进行加入时，我们还需要character而不是factor列，以防万一级别相同。

    library(dplyr)    
df = mutate_at(df, .cols = c("id", "major_rock", "minor_rock"), .funs = funs(as.character))

首先，我们使用measure＆＃34;创建＆＃34;依赖关系。对于区域和ID（编辑）具有明确依赖名称的数据框，然后我们将其汇总到id级别，对相关区域求和： / p>

dep_w_measure = df %>%
    select(dependency = id, major_rock, minor_rock, dep_area = area_ha) %>%
    inner_join(deps) %>%
    group_by(id, major_rock, minor_rock) %>%
    summarize(dep_area = sum(dep_area))

然后我们将其与原始数据相连，并减去相关区域（如果存在）：

result = left_join(df, dep_w_measure, by = c("major_rock", "minor_rock", "id")) %>%
    mutate(area_ind = area_ha - coalesce(dep_area, 0))
head(result)
#      id major_rock                                minor_rock area_ha dep_area area_ind
# 1 site1  greywacke sandstone mudstone basalt chert limestone 1066.68       NA  1066.68
# 2 site1   mudstone                                 limestone    7.59       NA     7.59
# 3 site1     gravel                            sand silt clay    3.41       NA     3.41
# 4 site2  greywacke sandstone mudstone basalt chert limestone 4434.76  1066.68  3368.08
# 5 site2     gravel                            sand silt clay  393.16     3.41   389.75
# 6 site2   mudstone     sandstone conglomerate coquina tephra  361.69       NA   361.69

我将dep_area和area_ha列留在＆＃34;显示我的工作＆＃34;，您可以根据需要清理它。独立区域area_ind列与所需输出中的area_ha匹配。

根据许多列减去行数

1 个答案: