根据参考范围将数字的数据帧转换为字符串等效项

时间:2017-08-23 20:29:23

标签: r string dplyr

我有一个与不同因素(x)测量的组(y)相关的数字分数的数据框,得到的分数。与下表类似。

EXPLAIN ANALYZE

表格就是这样创建的

BU      AUDIT CORC   GOV    PPS   TMSC   TRAIN
Unit1   2.00  0.00   2.00   4.00  1.50   2.50
Unit2   3.00  1.40   3.20   1.00  1.50   3.00
Unit3   2.50  2.40   2.80   3.00  2.75   2.50
Unit4   3.00  3.20   1.60   4.00  1.00   3.00
Unit5   2.00  2.80   2.00   2.00  3.00   2.50

这些数字分数具有对字符串值的引用,如下面的“表格”中所示。

df %>%
  group_by(BU, CC) %>% #BU = 'unit', CC = 'Control_Category
  summarise(avg = mean(Score, na.rm = TRUE)) %>%
  dcast(BU ~ CC, value.var = "avg") %>% print()

有人可以帮我找出保存数据表的方法,但是将每个数字分数转换为参考表中的字符串值吗?

我尝试了一些应用函数来尝试对值进行比较。还尝试使用case_when变异为无法使用。

最后,如果表格如下,那将是理想的:

Control_Score >  3.499 ~ "Ineffective",
Control_Score >  2.499  & Control_Score <= 3.499 ~ "Marginally Effective",
Control_Score >= 1.500  & Control_Score <= 2.499 ~ "Generally Effective",
Control_Score >  0.000  & Control_Score <= 1.499 ~ "Highly Effective"

4 个答案:

答案 0 :(得分:1)

您可以使用ifelse语句将数值更改为字符串。

BU <- c("Unit1", "Unit2", "Unit3", "Unit4", "Unit5")
Audit <- c(2,3,2.5,3,2)
CORC <- c(0,1.4,2.4,3.2,2.8)
GOV <- c(2,3.2,2.8,1.6,2)

df <- data.frame(BU, Audit, CORC, GOV)
df$BU <- as.character(df$BU)
df$Audit <- as.numeric(as.character(df$Audit))
df$CORC <- as.numeric(as.character(df$CORC))
df$GOV <- as.numeric(as.character(df$GOV))

df[,-1] <- ifelse(df[,-1]>3.499, "Ineffective",
                  ifelse(df[,-1]>2.499 & df[,-1]<=3.499, "Marginally Effective",
                         ifelse(df[,-1]>1.5 & df[,-1]<=2.499, "Generally Effective",
                                "Highly Effective")))

> df
     BU                Audit                 CORC                  GOV
1 Unit1  Generally Effective     Highly Effective  Generally Effective
2 Unit2 Marginally Effective     Highly Effective Marginally Effective
3 Unit3 Marginally Effective  Generally Effective Marginally Effective
4 Unit4 Marginally Effective Marginally Effective  Generally Effective
5 Unit5  Generally Effective Marginally Effective  Generally Effective

如果要折叠整个表格,可以添加以下代码:

df[2:(NROW(df)+1),] <- df[1:NROW(df),]
df[1,] <- colnames(df)

new_df <- apply( df, 1 , paste , collapse = "," )

输出

> new_df
                                                                    1 
                                                  "BU,Audit,CORC,GOV" 
                                                                    2 
     "Unit1,Generally Effective,Highly Effective,Generally Effective" 
                                                                    3 
   "Unit2,Marginally Effective,Highly Effective,Marginally Effective" 
                                                                    4 
"Unit3,Marginally Effective,Generally Effective,Marginally Effective" 
                                                                    5 
"Unit4,Marginally Effective,Marginally Effective,Generally Effective" 
                                                                    6 
 "Unit5,Generally Effective,Marginally Effective,Generally Effective" 

答案 1 :(得分:1)

您可以使用dplyr中的case_when来执行此操作。

    df1 <- read.table(header = TRUE,
  text = 'BU AUDIT CORC GOV PPS TMSC TRAIN
  Unit1   2.0  0.0 2.0   4 1.50   2.5
  Unit2   3.0  1.4 3.2   1 1.50   3.0
  Unit3   2.5  2.4 2.8   3 2.75   2.5
  Unit4   3.0  3.2 1.6   4 1.00   3.0
  Unit5   2.0  2.8 2.0   2 3.00   2.5
  ')

我把case_when放在一个函数中。

score_label <- function(score){
  lbl <- case_when(
    score < 1.5 ~ "Highly Effective",
    score >= 1.5 & score < 2.5 ~ "Generally Effective",
    score >= 2.5 & score < 3.5 ~ "Marginally Effective",
    score >= 3.5 ~ "Ineffective"
  )
  return(lbl)
} 

然后使用apply将函数应用于数据框(根据AOSmith的注释编辑,使用dplyr中的mutate_at而不是“apply”函数。更容易阅读和遵循。)

df_out <- df1 %>% 
    mutate_at(c("AUDIT", "CORC", "GOV", "PPS", "TMSC", "TRAIN"), score_label)

df_Out[,1:4]


   BU                AUDIT                 CORC                  GOV
Unit1  Generally Effective     Highly Effective  Generally Effective
Unit2 Marginally Effective     Highly Effective Marginally Effective
Unit3 Marginally Effective  Generally Effective Marginally Effective
Unit4 Marginally Effective Marginally Effective  Generally Effective
Unit5  Generally Effective Marginally Effective  Generally Effective

答案 2 :(得分:0)

cut是一个很好的功能,可以将数字划分为区间并给出解释性名称。

Control_Score <- c(-1, 0, 1.4, 1.5, 2.499, 2.5, 3.499, 3.5, 4)

cut(
  Control_Score,
  breaks = c(0, 1.5, 2.5, 3.5, Inf),
  labels = c(
    "Highly Effective",
    "Generally Effective",
    "Marginally Effective",
    "Ineffective"
  ),
  include.lowest = TRUE
)
# [1] <NA>                 Highly Effective     Highly Effective    
# [4] Highly Effective     Generally Effective  Generally Effective 
# [7] Marginally Effective Marginally Effective Ineffective         
# 4 Levels: Highly Effective Generally Effective ... Ineffective

正如您在-1中看到的那样,指定时间间隔之外的任何值都会被指定为NA。因此,无效数据不太可能被忽视。

替换df中的值:

df <- read.table(
  header = TRUE,
  text = 'BU AUDIT CORC GOV PPS TMSC TRAIN
Unit1   2.0  0.0 2.0   4 1.50   2.5
Unit2   3.0  1.4 3.2   1 1.50   3.0
Unit3   2.5  2.4 2.8   3 2.75   2.5
Unit4   3.0  3.2 1.6   4 1.00   3.0
Unit5   2.0  2.8 2.0   2 3.00   2.5
  ')

df[-1] <- lapply(
  df[-1],
  cut,
  breaks = c(0, 1.5, 2.5, 3.5, Inf),
  labels = c(
    "Highly Effective",
    "Generally Effective",
    "Marginally Effective",
    "Ineffective"
  ),
  include.lowest = TRUE
)

df[-1]只是意味着&#34;除了df&#34;的第一列以外的所有内容。使用实际数据所需的任何子集。

答案 3 :(得分:0)

您还可以使用def shift(l,n): n = n % len(l) return l[-U:] + l[:-U] 也是findInterval()函数:

base R

<强> 数据:

myintervals <- c(-Inf, 0, 1.5, 2.5, 3.5, Inf)
mylabels    <- c(NA, "Highly Effective", "Generally Effective", 
                 "Marginally Effective", "Ineffective")

df[,-1] <- mylabels[sapply(df[,-1], function(x) findInterval(x,myintervals))]

df
##      BU               AUDIT                CORC                 GOV
## 1 Unit1 Generally Effective    Highly Effective Generally Effective
## 2 Unit2 Marginally Effectiv    Highly Effective Marginally Effectiv
## 3 Unit3 Marginally Effectiv Generally Effective Marginally Effectiv
## 4 Unit4 Marginally Effectiv Marginally Effectiv Generally Effective
## 5 Unit5 Generally Effective Marginally Effective Generally Effective
##                   PPS                TMSC               TRAIN
## 1         Ineffective Generally Effective Marginally Effectiv
## 2    Highly Effective Generally Effective Marginally Effectiv
## 3 Marginally Effectiv Marginally Effectiv Marginally Effectiv
## 4         Ineffective    Highly Effective Marginally Effectiv
## 5 Generally Effective Marginally Effective Marginally Effective

注意: 我更喜欢 df <- structure(list(BU = structure(1:5, .Label = c("Unit1", "Unit2", "Unit3", "Unit4", "Unit5"), class = "factor"), AUDIT = c(2, 3, 2.5, 3, 2), CORC = c(0, 1.4, 2.4, 3.2, 2.8), GOV = c(2, 3.2, 2.8, 1.6, 2), PPS = c(4, 1, 3, 4, 2), TMSC = c(1.5, 1.5, 2.75, 1, 3), TRAIN = c(2.5, 3, 2.5, 3, 2.5)), .Names = c("BU", "AUDIT", "CORC", "GOV", "PPS", "TMSC", "TRAIN"), row.names = c(NA, 5L), class = "data.frame") 因为如果您的数据超出定义的边界会导致错误,因此您可以了解它们而不是分类它们默认为findInterval()NA做什么)。