我有一个与不同因素(x)测量的组(y)相关的数字分数的数据框,得到的分数。与下表类似。
EXPLAIN ANALYZE
表格就是这样创建的
BU AUDIT CORC GOV PPS TMSC TRAIN
Unit1 2.00 0.00 2.00 4.00 1.50 2.50
Unit2 3.00 1.40 3.20 1.00 1.50 3.00
Unit3 2.50 2.40 2.80 3.00 2.75 2.50
Unit4 3.00 3.20 1.60 4.00 1.00 3.00
Unit5 2.00 2.80 2.00 2.00 3.00 2.50
这些数字分数具有对字符串值的引用,如下面的“表格”中所示。
df %>%
group_by(BU, CC) %>% #BU = 'unit', CC = 'Control_Category
summarise(avg = mean(Score, na.rm = TRUE)) %>%
dcast(BU ~ CC, value.var = "avg") %>% print()
有人可以帮我找出保存数据表的方法,但是将每个数字分数转换为参考表中的字符串值吗?
我尝试了一些应用函数来尝试对值进行比较。还尝试使用case_when变异为无法使用。
最后,如果表格如下,那将是理想的:
Control_Score > 3.499 ~ "Ineffective",
Control_Score > 2.499 & Control_Score <= 3.499 ~ "Marginally Effective",
Control_Score >= 1.500 & Control_Score <= 2.499 ~ "Generally Effective",
Control_Score > 0.000 & Control_Score <= 1.499 ~ "Highly Effective"
答案 0 :(得分:1)
您可以使用ifelse语句将数值更改为字符串。
BU <- c("Unit1", "Unit2", "Unit3", "Unit4", "Unit5")
Audit <- c(2,3,2.5,3,2)
CORC <- c(0,1.4,2.4,3.2,2.8)
GOV <- c(2,3.2,2.8,1.6,2)
df <- data.frame(BU, Audit, CORC, GOV)
df$BU <- as.character(df$BU)
df$Audit <- as.numeric(as.character(df$Audit))
df$CORC <- as.numeric(as.character(df$CORC))
df$GOV <- as.numeric(as.character(df$GOV))
df[,-1] <- ifelse(df[,-1]>3.499, "Ineffective",
ifelse(df[,-1]>2.499 & df[,-1]<=3.499, "Marginally Effective",
ifelse(df[,-1]>1.5 & df[,-1]<=2.499, "Generally Effective",
"Highly Effective")))
> df
BU Audit CORC GOV
1 Unit1 Generally Effective Highly Effective Generally Effective
2 Unit2 Marginally Effective Highly Effective Marginally Effective
3 Unit3 Marginally Effective Generally Effective Marginally Effective
4 Unit4 Marginally Effective Marginally Effective Generally Effective
5 Unit5 Generally Effective Marginally Effective Generally Effective
如果要折叠整个表格,可以添加以下代码:
df[2:(NROW(df)+1),] <- df[1:NROW(df),]
df[1,] <- colnames(df)
new_df <- apply( df, 1 , paste , collapse = "," )
输出
> new_df
1
"BU,Audit,CORC,GOV"
2
"Unit1,Generally Effective,Highly Effective,Generally Effective"
3
"Unit2,Marginally Effective,Highly Effective,Marginally Effective"
4
"Unit3,Marginally Effective,Generally Effective,Marginally Effective"
5
"Unit4,Marginally Effective,Marginally Effective,Generally Effective"
6
"Unit5,Generally Effective,Marginally Effective,Generally Effective"
答案 1 :(得分:1)
您可以使用dplyr中的case_when来执行此操作。
df1 <- read.table(header = TRUE,
text = 'BU AUDIT CORC GOV PPS TMSC TRAIN
Unit1 2.0 0.0 2.0 4 1.50 2.5
Unit2 3.0 1.4 3.2 1 1.50 3.0
Unit3 2.5 2.4 2.8 3 2.75 2.5
Unit4 3.0 3.2 1.6 4 1.00 3.0
Unit5 2.0 2.8 2.0 2 3.00 2.5
')
我把case_when放在一个函数中。
score_label <- function(score){
lbl <- case_when(
score < 1.5 ~ "Highly Effective",
score >= 1.5 & score < 2.5 ~ "Generally Effective",
score >= 2.5 & score < 3.5 ~ "Marginally Effective",
score >= 3.5 ~ "Ineffective"
)
return(lbl)
}
然后使用apply将函数应用于数据框(根据AOSmith的注释编辑,使用dplyr中的mutate_at而不是“apply”函数。更容易阅读和遵循。)
df_out <- df1 %>%
mutate_at(c("AUDIT", "CORC", "GOV", "PPS", "TMSC", "TRAIN"), score_label)
df_Out[,1:4]
BU AUDIT CORC GOV
Unit1 Generally Effective Highly Effective Generally Effective
Unit2 Marginally Effective Highly Effective Marginally Effective
Unit3 Marginally Effective Generally Effective Marginally Effective
Unit4 Marginally Effective Marginally Effective Generally Effective
Unit5 Generally Effective Marginally Effective Generally Effective
答案 2 :(得分:0)
cut
是一个很好的功能,可以将数字划分为区间并给出解释性名称。
Control_Score <- c(-1, 0, 1.4, 1.5, 2.499, 2.5, 3.499, 3.5, 4)
cut(
Control_Score,
breaks = c(0, 1.5, 2.5, 3.5, Inf),
labels = c(
"Highly Effective",
"Generally Effective",
"Marginally Effective",
"Ineffective"
),
include.lowest = TRUE
)
# [1] <NA> Highly Effective Highly Effective
# [4] Highly Effective Generally Effective Generally Effective
# [7] Marginally Effective Marginally Effective Ineffective
# 4 Levels: Highly Effective Generally Effective ... Ineffective
正如您在-1
中看到的那样,指定时间间隔之外的任何值都会被指定为NA
。因此,无效数据不太可能被忽视。
替换df
中的值:
df <- read.table(
header = TRUE,
text = 'BU AUDIT CORC GOV PPS TMSC TRAIN
Unit1 2.0 0.0 2.0 4 1.50 2.5
Unit2 3.0 1.4 3.2 1 1.50 3.0
Unit3 2.5 2.4 2.8 3 2.75 2.5
Unit4 3.0 3.2 1.6 4 1.00 3.0
Unit5 2.0 2.8 2.0 2 3.00 2.5
')
df[-1] <- lapply(
df[-1],
cut,
breaks = c(0, 1.5, 2.5, 3.5, Inf),
labels = c(
"Highly Effective",
"Generally Effective",
"Marginally Effective",
"Ineffective"
),
include.lowest = TRUE
)
df[-1]
只是意味着&#34;除了df
&#34;的第一列以外的所有内容。使用实际数据所需的任何子集。
答案 3 :(得分:0)
您还可以使用def shift(l,n):
n = n % len(l)
return l[-U:] + l[:-U]
也是findInterval()
函数:
base R
<强> 数据:的强>
myintervals <- c(-Inf, 0, 1.5, 2.5, 3.5, Inf)
mylabels <- c(NA, "Highly Effective", "Generally Effective",
"Marginally Effective", "Ineffective")
df[,-1] <- mylabels[sapply(df[,-1], function(x) findInterval(x,myintervals))]
df
## BU AUDIT CORC GOV
## 1 Unit1 Generally Effective Highly Effective Generally Effective
## 2 Unit2 Marginally Effectiv Highly Effective Marginally Effectiv
## 3 Unit3 Marginally Effectiv Generally Effective Marginally Effectiv
## 4 Unit4 Marginally Effectiv Marginally Effectiv Generally Effective
## 5 Unit5 Generally Effective Marginally Effective Generally Effective
## PPS TMSC TRAIN
## 1 Ineffective Generally Effective Marginally Effectiv
## 2 Highly Effective Generally Effective Marginally Effectiv
## 3 Marginally Effectiv Marginally Effectiv Marginally Effectiv
## 4 Ineffective Highly Effective Marginally Effectiv
## 5 Generally Effective Marginally Effective Marginally Effective
注意: 我更喜欢 df <- structure(list(BU = structure(1:5, .Label = c("Unit1", "Unit2",
"Unit3", "Unit4", "Unit5"), class = "factor"), AUDIT = c(2, 3,
2.5, 3, 2), CORC = c(0, 1.4, 2.4, 3.2, 2.8), GOV = c(2, 3.2,
2.8, 1.6, 2), PPS = c(4, 1, 3, 4, 2), TMSC = c(1.5, 1.5, 2.75,
1, 3), TRAIN = c(2.5, 3, 2.5, 3, 2.5)), .Names = c("BU", "AUDIT",
"CORC", "GOV", "PPS", "TMSC", "TRAIN"), row.names = c(NA, 5L), class = "data.frame")
因为如果您的数据超出定义的边界会导致错误,因此您可以了解它们而不是分类它们默认为findInterval()
(NA
做什么)。