如何将一列分隔为多列(复杂列)

时间:2017-07-08 10:03:32

标签: r string data-manipulation

我正在尝试根据主题和成绩将“等级”列分隔为多个列

    grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",sep=";")

# Rename the column names

    names(grade)<-c("Student_ID","Name","Venue","Grade")

    head(grade)

    # Separate `Grade` into `subject` variables and coresponding `Grade`columns
    library(tidyverse)


    df<- grade %>% separate(Grade,paste("V",1:7,sep="_"),sep=":")

    head(df)

    # It still is not separating `subject ` and `grade` independently

    # Here is what I want it to look like

    new_df<-df[c(1:5),c(1:4)]

    new_df<-data.frame(new_df, V2=c(1:5)) # the same for V2,4,5,6,,7 to separate subject and grade

    new_df 

我正在尝试使用dplyr和stringr,但无法按预期生成结果

2 个答案:

答案 0 :(得分:4)

以下是使用tidyverse包的一次尝试。将所有内容转换为字符(即grade[] <- lapply(grade, as.character))之后,我们创建了一个自定义函数,返回每个subject:grade的已排序StudentID。然后,我们使用unnest将其设为长,并使用separate将其拆分为两列; SubjectGrade。最后,我们spread为每个主题获取一列。

library(tidyverse)

#This function could definetely be more elegant or even avoided
#  but this is as far as my regex knowledge allows me to go

mysplit <- function(x){
  y <- strsplit(x, ':\\s+|\\s+')[[1]]
  z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
  return(z[order(sub(':.*', '', z))])
}

grade %>% 
  mutate(Grade = lapply(Grade, mysplit)) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

将其拆分为:

...     Biology Chemitry English Geography History Literature Math Physics
...   1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
...   2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
...   3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
...   4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
...   5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
...   6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40
.
.

为了更好地理解这个功能,你应该将其分解。 比如x如下:

x
#[1] "Math:   4.25   Literature:   7.50   Physics:   6.80   Chemitry:   6.00   Biology:   6.00"

space: space拆分以获取以下向量

y <- strsplit(x, ':\\s+|\\s+')[[1]]
y
 #[1] "Math"       "4.25"       "Literature" "7.50"       "Physics"    "6.80"       "Chemitry"   "6.00"       "Biology"    "6.00"

首先将所有第一个元素(即主题y[c(TRUE, FALSE)])粘贴在一起,然后将所有第二个元素(即成绩y[c(FALSE, TRUE)])粘贴到:分隔符

z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
z
#[1] "Math: 4.25"       "Literature: 7.50" "Physics: 6.80"    "Chemitry: 6.00"   "Biology: 6.00"   

最后,它输出一个排序的(基于单词sub(':.*', '', z))vector

z[order(sub(':.*', '', z))]
#[1] "Biology: 6.00"    "Chemitry: 6.00"   "Literature: 7.50" "Math: 4.25"       "Physics: 6.80"

正如@rosscova指出的那样,字符串不需要排序,这简化了很多(毕竟不需要函数),即

grade %>% 
  mutate(Grade = strsplit(Grade, '[0-9]\\s+')) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

答案 1 :(得分:1)

在我的解决方案中,我使用了tidyverserebus包中的函数。 rebus包使用人类可读代码逐个构建正则表达式。

 library(tidyverse)
 library(rebus)
 grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",
                 sep = ";", stringsAsFactors = FALSE)

 grade_new <- grade %>%
   mutate(DIEM_THI2 = str_replace_all(DIEM_THI, pattern = ":" %R% one_or_more(SPC), "-")) %>%
   separate_rows(DIEM_THI2, sep = one_or_more(SPC)) %>%
   separate(DIEM_THI2, c("SUBJECT", "GRADE"), sep = "-") %>%
   spread(SUBJECT,GRADE)

结果数据框如下所示:

head(grade_new[,5:12])
#   Biology Chemitry English Geography History Literature Math Physics
# 1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
# 2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
# 3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
# 4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
# 5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
# 6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40

代码可以理解如下:

  1. 所有冒号+空格子串都用连字符替换。即"Math: 4.25 Literature: 7.50"变为"Math-4.25 Literature-7.50"。这是使用str_replace_all函数完成的。让我们调用新变量DIEM_THI2
  2. separate_rows函数将空格分隔的列DIEM_THI2拆分为单独的行,即"Math-4.25""Literature-7.50"跨越两个不同的行。
  3. DIEM_THI2列分为两列,即SUBJECTGRADE,其中前者包含"Math""Literature"等值,后者包含值例如"4.25""7.50"
  4. 键值对或SUBJECT-GRADE对分布在多个列中。