在data.frame中创建一个列,计算满足特定条件的列数

时间:2014-08-11 15:39:40

标签: r calculated-columns

我有一个包含dfcol1, col2... col25列的data.frame Threshold

我想创建一个新列A,它为每一行记录col1 ... col25中具有高于阈值的值的列数。

我想我能做到

df$A <- (df[paste("col",1,sep="")] >= df["Threshold"]) + (df[paste("col",2,sep="")] >= df["Threshold"]) + ...

但它不是很优雅,这让我觉得必须有一个更好,更紧凑的方式。

(注意:我需要从字符串重新组合列名,真正的列名是PV1MATH,PV2MATH,PV1SCIE等)。

编辑:生成数据

colnames <- paste("PV", rep(1:2, 5), c("MATH", "SCIE", "ENGI", "PHYS", "ARTS"), sep="")
df <- as.data.frame(matrix(rnorm(200, 60, 20), ncol=10))
names(df) <- colnames
df$Threshold <- rpois(20, 50)

1 个答案:

答案 0 :(得分:1)

我已经生成了一些随机数据,因此可以提供一个示例:

> colnames <- paste("PV", rep(1:2, 5), c("MATH", "SCIE", "ENGI", "PHYS", "ARTS"), sep="")
> df <- as.data.frame(matrix(rnorm(200, 60, 20), ncol=10))
> names(df) <- colnames
> df$Threshold <- rpois(20, 50)
> head(df)
   PV1MATH   PV2SCIE  PV1ENGI  PV2PHYS   PV1ARTS   PV2MATH   PV1SCIE  PV2ENGI  PV1PHYS  PV2ARTS Threshold
1 65.38862  59.10253 36.58240 54.32805  9.181924  55.01604 73.377464 75.57304 60.93116 31.99255        49
2 46.58772  81.16455 70.60132 19.45667 93.797606  12.80517 47.920166 51.90083 41.72037 63.98710        50
3 67.02016  57.85148 64.67905 24.49892 48.827826  57.26432 53.117871 67.83863 57.56008 67.69975        41
4 61.36172 107.93095 70.78672 38.21072 75.752956  48.12871 40.698131 82.58197 60.66945 61.52466        51
5 19.54413  51.27288 52.15215 71.99829 64.433654 116.80112 47.297671 57.39038 97.73618 75.57284        50
6 68.37724  40.35299 74.26690 60.44868 60.037653  40.99726  6.843594 84.68163 65.08556 62.26077        45
> 
> df$Above.Threshold <- rowSums(df[, -grep("Threshold", names(df))] > df$Threshold)
> head(df)
   PV1MATH   PV2SCIE  PV1ENGI  PV2PHYS   PV1ARTS   PV2MATH   PV1SCIE  PV2ENGI  PV1PHYS  PV2ARTS Threshold Above.Threshold
1 65.38862  59.10253 36.58240 54.32805  9.181924  55.01604 73.377464 75.57304 60.93116 31.99255        49               7
2 46.58772  81.16455 70.60132 19.45667 93.797606  12.80517 47.920166 51.90083 41.72037 63.98710        50               5
3 67.02016  57.85148 64.67905 24.49892 48.827826  57.26432 53.117871 67.83863 57.56008 67.69975        41               9
4 61.36172 107.93095 70.78672 38.21072 75.752956  48.12871 40.698131 82.58197 60.66945 61.52466        51               7
5 19.54413  51.27288 52.15215 71.99829 64.433654 116.80112 47.297671 57.39038 97.73618 75.57284        50               8
6 68.37724  40.35299 74.26690 60.44868 60.037653  40.99726  6.843594 84.68163 65.08556 62.26077        45               7

在您的情况下,您可以简单地使用衬垫

df$Above.Threshold <- rowSums(df[, -grep("Threshold", names(df))] > df$Threshold)

假设数据是名为data.frame的{​​{1}}。

或者,如果要选择在哪些列上计算上述阈值和,则可以更改df条件。例如,选择前缀为grep的列:

PV