R:使用dplyr创建一个包含每组百分比的数据帧

时间:2018-05-30 16:49:24

标签: r dplyr tidyverse

我在尝试使用dplyr创建一个每年创建一个百分比的新数据框时遇到了麻烦。

Dataframe如下:

    structure(list(orgid = c("USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", 
"USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", 
"USGS-PA", "USGS-PA", "USGS-NJ", "USGS-NJ", "USGS-NJ"), stdate = structure(c(16134, 
16133, 16135, 16133, 16105, 15749, 16112, 16394, 16610, 16610, 
16511, 16560, 16566, 16328, 16324), class = "Date"), locid = c("USGS-01367785", 
"USGS-01455099", "USGS-01440000", "USGS-01380100", "USGS-01398000", 
"USGS-01461880", "USGS-0140940950", "USGS-01482500", "USGS-0146453250", 
"USGS-0146453250", "USGS-01444800", "USGS-01444800", "USGS-01477120", 
"USGS-01392150", "USGS-01376274"), charnam = c("Total dissolved solids", 
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids", 
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids", 
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids", 
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids", 
"Total dissolved solids", "Total dissolved solids"), val = c("154", 
"333", "109", "143", "711", "218", "104", "157", "506", "471", 
"3040", "1110", "142", "429", "266")), .Names = c("orgid", "stdate", 
"locid", "charnam", "val"), row.names = c(NA, 15L), class = "data.frame")

我想创建一个新列,其中每年的总溶解固体百分比为> 500。

我到目前为止的代码:

if (!require(pacman)) {
  install.packages('pacman')

}

pacman::p_load("ggplot2","tidyr","plyr","dplyr")
#### Read in the necessary data ######
roadsalt_data<-read.table("QADportaldata_1988-2015.tsv",header=T,sep="\t",fill=T,stringsAsFactors = F)
#Convert date column from a character class to a date class so ggplot can  display as a continuous variable ###
roadsalt_data$stdate <- as.Date(roadsalt_data$stdate)
## Filter dataset to only contain columns I need ########
filtered_roadsalt <- roadsalt_data %>% 
  select(orgid, stdate,locid, charnam,val) %>%
  filter(between(stdate, as.Date("1996-01-01"), as.Date("2015-07-01"))) %>%
  filter(charnam == "Total dissolved solids" & as.numeric(as.character(val)) > 50.00)
##create a dataframe for percent of TDS >500
percent_data<-filtered_roadsalt %>%
  mutate(year=as.Date(cut(stdate, breaks = "year"))) %>%
  group_by(year) %>%
  mutate(prop = round(as.numeric(as.character(val))/sum(as.numeric(as.character(val)))*100, 2))

然而,这并没有让我得到我想要的结果..我想要的数据帧应该有19个观察值和2个变量。从1997年到2015年,每年有19次观察,并且有百分比。任何帮助将不胜感激!谢谢!

2 个答案:

答案 0 :(得分:2)

install.packages("scales")

scales::percent(2.842215e-03)

0.284%

有关其他策略,另请参阅options(digits=)options(scipen=)

答案 1 :(得分:1)

首先,您需要将val转换为数字并检索每个日期的年份。这可以使用lubridate::year完成。 count是按变量分组并将其汇总的简写,其中您需要的唯一摘要统计信息是观察次数。在您的完整数据集中可能不是这种情况,但在您发布的示例中,2013年没有任何值大于500的观察值,因此汇总数据中不会有(TRUE, 2013)行。所以我使用complete填写一行来明确显示其中的0个观察结果。

library(tidyverse)

shares <- df %>%
  as_tibble() %>%
  mutate(val = as.numeric(val)) %>%
  mutate(year = lubridate::year(stdate)) %>%
  count(year, charnam, isOver500 = val > 500) %>%
  complete(isOver500, nesting(year, charnam), fill = list(n = 0)) %>%
  mutate(share = n / sum(n))

shares
#> # A tibble: 6 x 5
#>   isOver500  year charnam                    n  share
#>   <lgl>     <dbl> <chr>                  <dbl>  <dbl>
#> 1 FALSE      2013 Total dissolved solids     1 0.0667
#> 2 FALSE      2014 Total dissolved solids     8 0.533 
#> 3 FALSE      2015 Total dissolved solids     2 0.133 
#> 4 TRUE       2013 Total dissolved solids     0 0     
#> 5 TRUE       2014 Total dissolved solids     1 0.0667
#> 6 TRUE       2015 Total dissolved solids     3 0.2

shares %>%
  filter(isOver500)
#> # A tibble: 3 x 5
#>   isOver500  year charnam                    n  share
#>   <lgl>     <dbl> <chr>                  <dbl>  <dbl>
#> 1 TRUE       2013 Total dissolved solids     0 0     
#> 2 TRUE       2014 Total dissolved solids     1 0.0667
#> 3 TRUE       2015 Total dissolved solids     3 0.2

reprex package(v0.2.0)创建于2018-05-30。