我正在尝试根据主题和成绩将“等级”列分隔为多个列
grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",sep=";")
# Rename the column names
names(grade)<-c("Student_ID","Name","Venue","Grade")
head(grade)
# Separate `Grade` into `subject` variables and coresponding `Grade`columns
library(tidyverse)
df<- grade %>% separate(Grade,paste("V",1:7,sep="_"),sep=":")
head(df)
# It still is not separating `subject ` and `grade` independently
# Here is what I want it to look like
new_df<-df[c(1:5),c(1:4)]
new_df<-data.frame(new_df, V2=c(1:5)) # the same for V2,4,5,6,,7 to separate subject and grade
new_df
我正在尝试使用dplyr和stringr,但无法按预期生成结果
答案 0 :(得分:4)
以下是使用tidyverse
包的一次尝试。将所有内容转换为字符(即grade[] <- lapply(grade, as.character)
)之后,我们创建了一个自定义函数,返回每个subject:grade
的已排序StudentID
。然后,我们使用unnest
将其设为长,并使用separate
将其拆分为两列; Subject
和Grade
。最后,我们spread
为每个主题获取一列。
library(tidyverse)
#This function could definetely be more elegant or even avoided
# but this is as far as my regex knowledge allows me to go
mysplit <- function(x){
y <- strsplit(x, ':\\s+|\\s+')[[1]]
z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
return(z[order(sub(':.*', '', z))])
}
grade %>%
mutate(Grade = lapply(Grade, mysplit)) %>%
unnest() %>%
separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>%
spread(Subject, Grade)
将其拆分为:
... Biology Chemitry English Geography History Literature Math Physics ... 1 6.00 6.00 <NA> <NA> <NA> 7.50 4.25 6.80 ... 2 5.80 6.00 <NA> <NA> <NA> 6.00 5.75 <NA> ... 3 <NA> <NA> <NA> 8.00 4.50 7.75 2.25 <NA> ... 4 <NA> <NA> <NA> 7.25 7.50 7.75 3.25 <NA> ... 5 <NA> <NA> <NA> 7.75 4.50 8.25 1.75 <NA> ... 6 <NA> 6.60 6.78 <NA> <NA> 7.00 8.75 8.40 . .
为了更好地理解这个功能,你应该将其分解。
比如x
如下:
x
#[1] "Math: 4.25 Literature: 7.50 Physics: 6.80 Chemitry: 6.00 Biology: 6.00"
每space
或: space
拆分以获取以下向量
y <- strsplit(x, ':\\s+|\\s+')[[1]]
y
#[1] "Math" "4.25" "Literature" "7.50" "Physics" "6.80" "Chemitry" "6.00" "Biology" "6.00"
首先将所有第一个元素(即主题y[c(TRUE, FALSE)]
)粘贴在一起,然后将所有第二个元素(即成绩y[c(FALSE, TRUE)]
)粘贴到:
分隔符
z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
z
#[1] "Math: 4.25" "Literature: 7.50" "Physics: 6.80" "Chemitry: 6.00" "Biology: 6.00"
最后,它输出一个排序的(基于单词sub(':.*', '', z)
)vector
z[order(sub(':.*', '', z))]
#[1] "Biology: 6.00" "Chemitry: 6.00" "Literature: 7.50" "Math: 4.25" "Physics: 6.80"
正如@rosscova指出的那样,字符串不需要排序,这简化了很多(毕竟不需要函数),即
grade %>%
mutate(Grade = strsplit(Grade, '[0-9]\\s+')) %>%
unnest() %>%
separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>%
spread(Subject, Grade)
答案 1 :(得分:1)
在我的解决方案中,我使用了tidyverse
和rebus
包中的函数。 rebus
包使用人类可读代码逐个构建正则表达式。
library(tidyverse)
library(rebus)
grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",
sep = ";", stringsAsFactors = FALSE)
grade_new <- grade %>%
mutate(DIEM_THI2 = str_replace_all(DIEM_THI, pattern = ":" %R% one_or_more(SPC), "-")) %>%
separate_rows(DIEM_THI2, sep = one_or_more(SPC)) %>%
separate(DIEM_THI2, c("SUBJECT", "GRADE"), sep = "-") %>%
spread(SUBJECT,GRADE)
结果数据框如下所示:
head(grade_new[,5:12])
# Biology Chemitry English Geography History Literature Math Physics
# 1 6.00 6.00 <NA> <NA> <NA> 7.50 4.25 6.80
# 2 5.80 6.00 <NA> <NA> <NA> 6.00 5.75 <NA>
# 3 <NA> <NA> <NA> 8.00 4.50 7.75 2.25 <NA>
# 4 <NA> <NA> <NA> 7.25 7.50 7.75 3.25 <NA>
# 5 <NA> <NA> <NA> 7.75 4.50 8.25 1.75 <NA>
# 6 <NA> 6.60 6.78 <NA> <NA> 7.00 8.75 8.40
代码可以理解如下:
"Math: 4.25 Literature: 7.50"
变为"Math-4.25 Literature-7.50"
。这是使用str_replace_all
函数完成的。让我们调用新变量DIEM_THI2
。separate_rows
函数将空格分隔的列DIEM_THI2
拆分为单独的行,即"Math-4.25"
和"Literature-7.50"
跨越两个不同的行。DIEM_THI2
列分为两列,即SUBJECT
和GRADE
,其中前者包含"Math"
,"Literature"
等值,后者包含值例如"4.25"
和"7.50"
。