将标题放在行中的整洁和投射数据

时间:2017-10-03 18:01:34

标签: r tidyr reshape2

demodf <- data.frame(
  name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
  Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
  Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))

   name     Field    Values
1  Mike EDUCATION EDUCATION
2  Mike    Degree   Masters
3  Mike     Title   Student
4  Mike      WORK      WORK
5  Mike     Title  VP Sales
6   Joe EDUCATION EDUCATION
7   Joe    Degree  Bachelors
8   Joe     Title   Student
9   Joe      WORK      WORK
10  Joe     Title   Analyst

我希望tidyr::spreadreshape2::dcast采用宽格式,其中Field成为列标题。

该代码看起来像dcast(demodf, name ~ Values)demodf %>% spread(Field, Values)。但是,dcast强制为数字,spread会引发错误。

问题在于&#34;标题&#34;重复。您可以看到,由于数据中的怪癖,我们将教育和工作视为&#34; false&#34;数据中的标头。是否可以使用大写标题标记每个Field条目,以便dcast起作用(即Title_EDUCATIONTitle_WORK)?最好将这种转变应用于整个Field,所以&#34;教育&#34;和&#34;工作&#34;一起消失,我们离开了Degree_EDUCATION, TITLE_EDUCATION ......等等。)

请注意,实际数据中有更多标头,因此最好识别&#34; false标头&#34;作为全部条目条目,或Field == Values

的条目

期望的输出:

output <- data.frame(
 Name=c("Mike", "Joe"),
 Degree_EDUCATION =c("Masters", "Bachelors"),
 Title_EDUCATION = c("Student", "Student"),
 Title_WORK= c("VP Sales", "Analyst"))

  Name Degree_EDUCATION Title_EDUCATION Title_WORK
1 Mike          Masters         Student   VP Sales
2  Joe        Bachelors         Student    Analyst

2 个答案:

答案 0 :(得分:3)

关键是将重复的类别行添加为新列,然后您可以轻松使用它。

首先,添加stringsAsFactors=FALSE以便比较FieldValues

demodf <- data.frame(
  name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
  Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
  Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
  stringsAsFactors=FALSE)

现在使用tidyrdplyr添加列,如果该行是类别和该类别的名称,则填充缺少的值,然后删除额外的行和列。

library(tidyr)
library(dplyr)
d2 <- demodf %>% mutate(IsCategory=Field==Values,
                        Category=ifelse(IsCategory, Field, NA)) %>%
  fill(Category) %>% subset(!IsCategory, select=-IsCategory)
d2
##    name  Field    Values  Category
## 2  Mike Degree   Masters EDUCATION
## 3  Mike  Title   Student EDUCATION
## 5  Mike  Title  VP Sales      WORK
## 7   Joe Degree Bachelors EDUCATION
## 8   Joe  Title   Student EDUCATION
## 10  Joe  Title   Analyst      WORK
然后

dcast将按您的希望工作!

library(reshape2)    
dcast(d2, name ~ Field+Category, value.var="Values")
##   name Degree_EDUCATION Title_EDUCATION Title_WORK
## 1  Joe        Bachelors         Student    Analyst
## 2 Mike          Masters         Student   VP Sales

答案 1 :(得分:0)

以下是data.table的尝试。这要求使用stringsAsFactors = FALSE。

library(data.table)
# get groupings by titles (all caps)
setDT(demodf)[, head := cumsum(Field == toupper(Field))]
# merge titles onto full dataset and paste title to Field
demodf[demodf[Field == toupper(Field), .(Field, head)], on="head",
       Field := paste(Field, i.Field, sep="_"), by=.EACHI]
# now reshape wide
dcast(demodf[Values != toupper(Values),], name~Field, value.var="Values")

返回

   name Degree_EDUCATION Title_EDUCATION Title_WORK
1:  Joe        Bachelors         Student    Analyst
2: Mike          Masters         Student   VP Sales

数据

demodf <-
structure(list(name = c("Mike", "Mike", "Mike", "Mike", "Mike", 
"Joe", "Joe", "Joe", "Joe", "Joe"), Field = c("EDUCATION", "Degree", 
"Title", "WORK", "Title", "EDUCATION", "Degree", "Title", "WORK", 
"Title"), Values = c("EDUCATION", "Masters", "Student", "WORK", 
"VP Sales", "EDUCATION", "Bachelors", "Student", "WORK", "Analyst"
)), .Names = c("name", "Field", "Values"), row.names = c(NA, 
-10L), class = "data.frame")