我想对R中包含整数值的数据帧中的列进行平均,偶尔也会使用NA。
数据框称为CD6(气候分部6),其初始化为NA值,用于存储属于气候分部6的所有数据的平均值。行是日期,列表示从0到23的小时。数据框看起来像这样:
> CD6
Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 ... H23
1948-07-01 NA NA NA NA NA NA NA NA NA NA NA ... NA
1948-07-02 NA NA NA NA NA NA NA NA NA NA NA ... NA
1948-07-03 NA NA NA NA NA NA NA NA NA NA NA ... NA
名为CA的数据框具有从1到7的所有气候区划的真实值。数据框看起来像这样:
> CA
Climate_Division Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 ... H23
6 1948-07-01 NA NA NA NA NA NA NA NA NA NA NA ... NA
5 1948-07-01 0 1 1 3 0 0 0 0 0 0 0 ... 2
6 1948-07-01 0 1 1 3 0 0 0 0 0 0 0 ... 2
6 1948-07-01 1 0 0 5 7 0 1 1 1 0 0 ... 0
6 1948-07-02 0 2 1 2 1 1 NA 0 1 0 1 ... 2
6 1948-07-03 NA NA NA NA NA NA NA NA NA NA NA ... NA
我有一个for循环编码,它将逐行迭代数据帧CA并映射到气候区的正确数据帧(在本例中为气候区6的CD6)。一个问题是,我不知道每个气候区有多少行可以正确地取其平均值。
通过仅查看CD6,我想获得特定小时的每个日期的平均值,如果存在真值并且最终答案是整数(值的上限),则忽略NA。如果各个气候区的所有时间都是NA的值,我想保持它与0相反。最终结果对于CD6应该是这样的
> CD6
Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 ... H23
1948-07-01 1 1 1 4 4 0 1 1 1 0 0 ... 1
1948-07-02 0 2 1 2 1 1 NA 0 1 0 1 ... 2
1948-07-03 NA NA NA NA NA NA NA NA NA NA NA ... NA
我不确切地知道如何编写它并使其精通。所以任何建议都会有所帮助,感谢您的时间。
答案 0 :(得分:2)
您要寻找的是通过对两列CA
分组的汇总方式,即Climate_Division
和Date
。您可以使用内置的aggregate
函数来执行此操作。
> t <- 'Climate_Division Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10
+ 6 1948-07-01 NA NA NA NA NA NA NA NA NA NA NA
+ 5 1948-07-01 0 1 1 3 0 0 0 0 0 0 0
+ 6 1948-07-01 0 1 1 3 0 0 0 0 0 0 0
+ 6 1948-07-01 1 0 0 5 7 0 1 1 1 0 0
+ 6 1948-07-02 0 2 1 2 1 1 NA 0 1 0 1
+ 6 1948-07-03 NA NA NA NA NA NA NA NA NA NA NA'
>
> CA <- read.table(textConnection(t), header=T)
>
> CA
Climate_Division Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10
1 6 1948-07-01 NA NA NA NA NA NA NA NA NA NA NA
2 5 1948-07-01 0 1 1 3 0 0 0 0 0 0 0
3 6 1948-07-01 0 1 1 3 0 0 0 0 0 0 0
4 6 1948-07-01 1 0 0 5 7 0 1 1 1 0 0
5 6 1948-07-02 0 2 1 2 1 1 NA 0 1 0 1
6 6 1948-07-03 NA NA NA NA NA NA NA NA NA NA NA
> #Now that we have our data, we do aggregation of data and calculate mean over that using following command
> CAMeans <- aggregate(CA[,3:13], by =list(CA[,1], CA[,2]), FUN = mean, na.rm = TRUE)
>
> CAMeans
Group.1 Group.2 H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10
1 5 1948-07-01 0.0 1.0 1.0 3 0.0 0 0.0 0.0 0.0 0 0
2 6 1948-07-01 0.5 0.5 0.5 4 3.5 0 0.5 0.5 0.5 0 0
3 6 1948-07-02 0.0 2.0 1.0 2 1.0 1 NaN 0.0 1.0 0 1
4 6 1948-07-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>
> #Need to change the names of grouping column back to what they were before
> names(CAMeans)[1:2] <- c('Climate_Division', 'Date')
>
> CAMeans
Climate_Division Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10
1 5 1948-07-01 0.0 1.0 1.0 3 0.0 0 0.0 0.0 0.0 0 0
2 6 1948-07-01 0.5 0.5 0.5 4 3.5 0 0.5 0.5 0.5 0 0
3 6 1948-07-02 0.0 2.0 1.0 2 1.0 1 NaN 0.0 1.0 0 1
4 6 1948-07-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>
> #Now you can subset CAMeans to get content for CD6
> CD6 <- CAMeans[CAMeans$Climate_Division == 6, 2:ncol(CAMeans)]
>
> CD6
Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10
2 1948-07-01 0.5 0.5 0.5 4 3.5 0 0.5 0.5 0.5 0 0
3 1948-07-02 0.0 2.0 1.0 2 1.0 1 NaN 0.0 1.0 0 1
4 1948-07-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
答案 1 :(得分:1)
猜测你想要什么,所以我提供了2个选项:rowMeans()
和colMeans()
。
CA <- read.table(
header=TRUE, text='Climate_Division Date H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H23
6 1948-07-01 NA NA NA NA NA NA NA NA NA NA NA NA
5 1948-07-01 0 1 1 3 0 0 0 0 0 0 0 2
6 1948-07-01 0 1 1 3 0 0 0 0 0 0 0 2
6 1948-07-01 1 0 0 5 7 0 1 1 1 0 0 0
6 1948-07-02 0 2 1 2 1 1 NA 0 1 0 1 2
6 1948-07-03 NA NA NA NA NA NA NA NA NA NA NA NA')
CD6 <- data[CA$Climate_Division==6, ] # Populating your data does not require a loop.
(CD6rmeans <- rowMeans(CD6[, -2], na.rm=TRUE))
# 1 3 4 5 6
# 6.000 1.000 1.692 1.417 6.000
t(CD6cmeans <- colMeans(CD6[ ,-2], na.rm=TRUE))
# Climate_Division H0 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H23
# [1,] 6 0.3333 1 0.6667 3.333 2.667 0.3333 0.5 0.3333 0.6667 0 0.3333 1.333