我在尝试使用dplyr创建一个每年创建一个百分比的新数据框时遇到了麻烦。
Dataframe如下:
structure(list(orgid = c("USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ",
"USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ", "USGS-NJ",
"USGS-PA", "USGS-PA", "USGS-NJ", "USGS-NJ", "USGS-NJ"), stdate = structure(c(16134,
16133, 16135, 16133, 16105, 15749, 16112, 16394, 16610, 16610,
16511, 16560, 16566, 16328, 16324), class = "Date"), locid = c("USGS-01367785",
"USGS-01455099", "USGS-01440000", "USGS-01380100", "USGS-01398000",
"USGS-01461880", "USGS-0140940950", "USGS-01482500", "USGS-0146453250",
"USGS-0146453250", "USGS-01444800", "USGS-01444800", "USGS-01477120",
"USGS-01392150", "USGS-01376274"), charnam = c("Total dissolved solids",
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids",
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids",
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids",
"Total dissolved solids", "Total dissolved solids", "Total dissolved solids",
"Total dissolved solids", "Total dissolved solids"), val = c("154",
"333", "109", "143", "711", "218", "104", "157", "506", "471",
"3040", "1110", "142", "429", "266")), .Names = c("orgid", "stdate",
"locid", "charnam", "val"), row.names = c(NA, 15L), class = "data.frame")
我想创建一个新列,其中每年的总溶解固体百分比为> 500。
我到目前为止的代码:
if (!require(pacman)) {
install.packages('pacman')
}
pacman::p_load("ggplot2","tidyr","plyr","dplyr")
#### Read in the necessary data ######
roadsalt_data<-read.table("QADportaldata_1988-2015.tsv",header=T,sep="\t",fill=T,stringsAsFactors = F)
#Convert date column from a character class to a date class so ggplot can display as a continuous variable ###
roadsalt_data$stdate <- as.Date(roadsalt_data$stdate)
## Filter dataset to only contain columns I need ########
filtered_roadsalt <- roadsalt_data %>%
select(orgid, stdate,locid, charnam,val) %>%
filter(between(stdate, as.Date("1996-01-01"), as.Date("2015-07-01"))) %>%
filter(charnam == "Total dissolved solids" & as.numeric(as.character(val)) > 50.00)
##create a dataframe for percent of TDS >500
percent_data<-filtered_roadsalt %>%
mutate(year=as.Date(cut(stdate, breaks = "year"))) %>%
group_by(year) %>%
mutate(prop = round(as.numeric(as.character(val))/sum(as.numeric(as.character(val)))*100, 2))
然而,这并没有让我得到我想要的结果..我想要的数据帧应该有19个观察值和2个变量。从1997年到2015年,每年有19次观察,并且有百分比。任何帮助将不胜感激!谢谢!
答案 0 :(得分:2)
install.packages("scales")
scales::percent(2.842215e-03)
0.284%
有关其他策略,另请参阅options(digits=)
和options(scipen=)
。
答案 1 :(得分:1)
首先,您需要将val
转换为数字并检索每个日期的年份。这可以使用lubridate::year
完成。 count
是按变量分组并将其汇总的简写,其中您需要的唯一摘要统计信息是观察次数。在您的完整数据集中可能不是这种情况,但在您发布的示例中,2013年没有任何值大于500的观察值,因此汇总数据中不会有(TRUE, 2013)
行。所以我使用complete
填写一行来明确显示其中的0个观察结果。
library(tidyverse)
shares <- df %>%
as_tibble() %>%
mutate(val = as.numeric(val)) %>%
mutate(year = lubridate::year(stdate)) %>%
count(year, charnam, isOver500 = val > 500) %>%
complete(isOver500, nesting(year, charnam), fill = list(n = 0)) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 6 x 5
#> isOver500 year charnam n share
#> <lgl> <dbl> <chr> <dbl> <dbl>
#> 1 FALSE 2013 Total dissolved solids 1 0.0667
#> 2 FALSE 2014 Total dissolved solids 8 0.533
#> 3 FALSE 2015 Total dissolved solids 2 0.133
#> 4 TRUE 2013 Total dissolved solids 0 0
#> 5 TRUE 2014 Total dissolved solids 1 0.0667
#> 6 TRUE 2015 Total dissolved solids 3 0.2
shares %>%
filter(isOver500)
#> # A tibble: 3 x 5
#> isOver500 year charnam n share
#> <lgl> <dbl> <chr> <dbl> <dbl>
#> 1 TRUE 2013 Total dissolved solids 0 0
#> 2 TRUE 2014 Total dissolved solids 1 0.0667
#> 3 TRUE 2015 Total dissolved solids 3 0.2
由reprex package(v0.2.0)创建于2018-05-30。