根据可以在各行中显示的唯一ID号在r中使用data.table进行强制转换

时间:2019-04-30 23:03:36

标签: r data.table

第三次编辑

我尝试了tidyverse解决方案,它适用于我的示例数据,但不适用于我的真实数据。

例如:

Example2 <- Example %>% # tidyverse option
  gather(key, value, -(2:6), -Degree_Level) %>%
  unite(key, key, Degree_Level) %>%
  spread(key, value)
dput(Example2)

给我这个结果:

attributes are not identical across measure variables;
they will be droppedstructure(list(Student_ID = c(9010307, 200810309, 200920773, 
201020497, 201030353, 201040559), Doc_Type = c("SSN", "SSN", 
"SSN", "SSN", "SSN", "DL"), Doc_Num = c(506786590, 546764202, 
546849791, 548017430, 547490424, 301147353), Last_Name = c("Sanchez", 
"Rivera", "Anderson", "Yang", "del Torre", "Smith"), First_Names = c("Jose", 
"Ana Maria", "Rachel Anne", "Amanda", "Amanda", "Daniel Erick"
), Campus_A = c(NA, NA, NA, "C", NA, "A"), Campus_B = c("A", 
"A", "B", "C", "A", "A"), Degree_Field_A = c(NA, NA, NA, "Civil Engineering", 
NA, "Education"), Degree_Field_B = c("Education", "Nursing", 
"Psychology", "Civil Engineering", "Psychology", "Education"), 
    Degree_Name_A = c(NA, NA, NA, "BS in Civil Engineering", 
    NA, "BA in Education"), Degree_Name_B = c("MA in Education", 
    "MS in Nursing", "MS in Psychology", "MS in Civil Engineering", 
    "MS in Psychology", "MA in Education"), Department_A = c(NA, 
    NA, NA, "Engineering", NA, "Education"), Department_B = c("Education", 
    "Health Sciences", "Health Sciences", "Engineering", "Health Sciences", 
    "Education"), Diploma_Number_A = c(NA, NA, NA, "7959", NA, 
    "7870"), Diploma_Number_B = c("7876", "7872", "7873", "12689", 
    "7875", "8155"), Exp_A = c(NA, NA, NA, "72", NA, "4"), Exp_B = c("3", 
    "2", "1", "5598", "7", "275"), Gender_A = c(NA, NA, NA, "F", 
    NA, "M"), Gender_B = c("M", "F", "F", "F", "F", "M"), Graduation_Date_A = c(NA, 
    NA, NA, "1440979200", NA, "1438560000"), Graduation_Date_B = c("1438560000", 
    "1438560000", "1438646400", "1512086400", "1438646400", "1445472000"
    ), Project_Type_A = c(NA, NA, NA, "Project", NA, "Project"
    ), Project_Type_B = c("Internship", "Thesis", "Internship", 
    "Thesis", "Thesis", "Internship")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

或者如果我将集合转移到gather(key, value, -(1:6), -Degree_Level) %>%上,我将得到:

attributes are not identical across measure variables;
they will be droppedstructure(list(Exp = c(1, 2, 3, 4, 7, 72, 275, 5598), Student_ID = c(200920773, 
200810309, 9010307, 201040559, 201030353, 201020497, 201040559, 
201020497), Doc_Type = c("SSN", "SSN", "SSN", "DL", "SSN", "SSN", 
"DL", "SSN"), Doc_Num = c(546849791, 546764202, 506786590, 301147353, 
547490424, 548017430, 301147353, 548017430), Last_Name = c("Anderson", 
"Rivera", "Sanchez", "Smith", "del Torre", "Yang", "Smith", "Yang"
), First_Names = c("Rachel Anne", "Ana Maria", "Jose", "Daniel Erick", 
"Amanda", "Amanda", "Daniel Erick", "Amanda"), Campus_A = c(NA, 
NA, NA, "A", NA, "C", NA, NA), Campus_B = c("B", "A", "A", NA, 
"A", NA, "A", "C"), Degree_Field_A = c(NA, NA, NA, "Education", 
NA, "Civil Engineering", NA, NA), Degree_Field_B = c("Psychology", 
"Nursing", "Education", NA, "Psychology", NA, "Education", "Civil Engineering"
), Degree_Name_A = c(NA, NA, NA, "BA in Education", NA, "BS in Civil Engineering", 
NA, NA), Degree_Name_B = c("MS in Psychology", "MS in Nursing", 
"MA in Education", NA, "MS in Psychology", NA, "MA in Education", 
"MS in Civil Engineering"), Department_A = c(NA, NA, NA, "Education", 
NA, "Engineering", NA, NA), Department_B = c("Health Sciences", 
"Health Sciences", "Education", NA, "Health Sciences", NA, "Education", 
"Engineering"), Diploma_Number_A = c(NA, NA, NA, "7870", NA, 
"7959", NA, NA), Diploma_Number_B = c("7873", "7872", "7876", 
NA, "7875", NA, "8155", "12689"), Gender_A = c(NA, NA, NA, "M", 
NA, "F", NA, NA), Gender_B = c("F", "F", "M", NA, "F", NA, "M", 
"F"), Graduation_Date_A = c(NA, NA, NA, "1438560000", NA, "1440979200", 
NA, NA), Graduation_Date_B = c("1438646400", "1438560000", "1438560000", 
NA, "1438646400", NA, "1445472000", "1512086400"), Project_Type_A = c(NA, 
NA, NA, "Project", NA, "Project", NA, NA), Project_Type_B = c("Internship", 
"Thesis", "Internship", NA, "Thesis", NA, "Internship", "Thesis"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

问题是,使用我的实际数据,我可以完成(1:6)版本,而没有任何问题,但是它没有给我想要的输出,因为它没有结合基于Student_ID的行。但是,如果我尝试使用(2:6)进行操作,则会收到此错误:

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 612 rows: * 113609, 113610 * 109095, 115383 * 110472, 110895 * 114397, 115479 * 113072, 114744 * 114414, 115480 * 108967, 111112 * 110532, 112950 * 110537, 112969 * 110492, 110493 * 110781, 110782 * 114412, 114413 * 115456, 115457 * 116933, 116934 * 117238, 117239 * 117050, 117134 * 115959, 115960 * 114521, 114522 * 13061, 13062 * 8547, 14835 * 9924, 10347 * 13849, 14931 * 12524, 14196 * 13866, 14932 * 8419, 10564 * 9984, 12402 * 9989, 12421 * 9944, 9945 * 10233, 10234 * 13864, 13865 * 14908, 14909 * 16385, 16386 * 16690, 16691 * 16502, 16586 * 15411, 15412 * 13973, 13974 * 38198, 38199 * 33684, 39972 * 35061, 35484 * 38986, 40068 * 37661, 39333 * 39003, 40069 * 33556, 35701 * 35121, 37539 * 35126, 37558 * 35081, 35082 * 35370, 35371 * 39001, 39002 * 40045, 40046 * 41522, 41523 * 41827, 41828 * 41639, 41723 * 40548, 40549 * 39110, 39111 * 138746, 138747 * 134232, 140520 * 135609, 136032 *

第二次编辑

感谢到目前为止的帮助,我想用一个更有用的数据示例进行更新。

> dput(Example)
structure(list(Exp = c(4, 3, 2, 7, 1, 72, 275, 5598), Student_ID = c(201040559, 
9010307, 200810309, 201030353, 200920773, 201020497, 201040559, 
201020497), Doc_Type = c("DL", "SSN", "SSN", "SSN", "SSN", "SSN", 
"DL", "SSN"), Doc_Num = c(301147353, 506786590, 546764202, 547490424, 
546849791, 548017430, 301147353, 548017430), Last_Name = c("Smith", 
"Sanchez", "Rivera", "del Torre", "Anderson", "Yang", "Smith", 
"Yang"), First_Names = c("Daniel Erick", "Jose", "Ana Maria", 
"Amanda", "Rachel Anne", "Amanda", "Daniel Erick", "Amanda"), 
    Gender = c("M", "M", "F", "F", "F", "F", "M", "F"), Degree_Field = c("Education", 
    "Education", "Nursing", "Psychology", "Psychology", "Civil Engineering", 
    "Education", "Civil Engineering"), Department = c("Education", 
    "Education", "Health Sciences", "Health Sciences", "Health Sciences", 
    "Engineering", "Education", "Engineering"), Campus = c("A", 
    "A", "A", "A", "B", "C", "A", "C"), Degree_Name = c("BA in Education", 
    "MA in Education", "MS in Nursing", "MS in Psychology", "MS in Psychology", 
    "BS in Civil Engineering", "MA in Education", "MS in Civil Engineering"
    ), Degree_Level = c("A", "B", "B", "B", "B", "A", "B", "B"
    ), Graduation_Date = structure(c(1438560000, 1438560000, 
    1438560000, 1438646400, 1438646400, 1440979200, 1445472000, 
    1512086400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Project_Type = c("Project", "Internship", "Thesis", "Thesis", 
    "Internship", "Project", "Internship", "Thesis"), Diploma_Number = c("7870", 
    "7876", "7872", "7875", "7873", "7959", "8155", "12689")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

在RStudio中,它看起来像这样: enter image description here 当我尝试提供的第一个解决方案时,它看起来像这样:

Example
Example2 <- Example %>%
  gather(key, value, -(2:7), -Degree_Level) %>%
  unite(key, key, Degree_Level) %>%
  spread(key, value)
dput(Example2)

这使我进入控制台:

attributes are not identical across measure variables;
they will be droppedstructure(list(Student_ID = c(9010307, 200810309, 200920773, 
201020497, 201030353, 201040559), Doc_Type = c("SSN", "SSN", 
"SSN", "SSN", "SSN", "DL"), Doc_Num = c(506786590, 546764202, 
546849791, 548017430, 547490424, 301147353), Last_Name = c("Sanchez", 
"Rivera", "Anderson", "Yang", "del Torre", "Smith"), First_Names = c("Jose", 
"Ana Maria", "Rachel Anne", "Amanda", "Amanda", "Daniel Erick"
), Gender = c("M", "F", "F", "F", "F", "M"), Campus_A = c(NA, 
NA, NA, "C", NA, "A"), Campus_B = c("A", "A", "B", "C", "A", 
"A"), Degree_Field_A = c(NA, NA, NA, "Civil Engineering", NA, 
"Education"), Degree_Field_B = c("Education", "Nursing", "Psychology", 
"Civil Engineering", "Psychology", "Education"), Degree_Name_A = c(NA, 
NA, NA, "BS in Civil Engineering", NA, "BA in Education"), Degree_Name_B = c("MA in Education", 
"MS in Nursing", "MS in Psychology", "MS in Civil Engineering", 
"MS in Psychology", "MA in Education"), Department_A = c(NA, 
NA, NA, "Engineering", NA, "Education"), Department_B = c("Education", 
"Health Sciences", "Health Sciences", "Engineering", "Health Sciences", 
"Education"), Diploma_Number_A = c(NA, NA, NA, "7959", NA, "7870"
), Diploma_Number_B = c("7876", "7872", "7873", "12689", "7875", 
"8155"), Exp_A = c(NA, NA, NA, "72", NA, "4"), Exp_B = c("3", 
"2", "1", "5598", "7", "275"), Graduation_Date_A = c(NA, NA, 
NA, "1440979200", NA, "1438560000"), Graduation_Date_B = c("1438560000", 
"1438560000", "1438646400", "1512086400", "1438646400", "1445472000"
), Project_Type_A = c(NA, NA, NA, "Project", NA, "Project"), 
    Project_Type_B = c("Internship", "Thesis", "Internship", 
    "Thesis", "Thesis", "Internship")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

问题是,在我的实际数据样本中,我在控制台中收到此错误(并且我点击了Show Traceback)

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 324 rows: * 54956, 54957 * 50442, 56730 * 51819, 52242 * 55744, 56826 * 54419, 56091 * 55761, 56827 * 50314, 52459 * 51879, 54297 * 51884, 54316 * 51839, 51840 * 52128, 52129 * 55759, 55760 * 56803, 56804 * 58280, 58281 * 58585, 58586 * 58397, 58481 * 57306, 57307 * 55868, 55869 * 71714, 71715 * 67200, 73488 * 68577, 69000 * 72502, 73584 * 71177, 72849 * 72519, 73585 * 67072, 69217 * 68637, 71055 * 68642, 71074 * 68597, 68598 * 68886, 68887 * 72517, 72518 * 73561, 73562 * 75038, 75039 * 75343, 75344 * 75155, 75239 * 74064, 74065 * 72626, 72627 * 4682, 4683 * 168, 6456 * 1545, 1968 * 5470, 6552 * 4145, 5817 * 5487, 6553 * 40, 2185 * 1605, 4023 * 1610, 4042 * 1565, 1566 * 1854, 1855 * 5485, 5486 * 6529, 6530 * 8006, 8007 * 8311, 8312 * 8123, 8207 * 7032, 7033 * 5594, 5595 * 21440, 21441 * 16926, 23214 * 18303, 18726 * 22228, 23310 * 20903, 22575 * 22245, 23311 * 16798, 18943 * 18363, 20781
12.
stop(cnd)
11.
abort(glue("Each row of output must be identified by a unique combination of keys.", "\nKeys are shared for {shared} rows:", "\n{rows}", "Do you need to create unique ID with tibble::rowid_to_column()?"))
10.
spread.data.frame(., key, value)
9.
spread(., key, value)
8.
function_list[[k]](value)
7.
withVisible(function_list[[k]](value))
6.
freduce(value, `_function_list`)
5.
`_fseq`(`_lhs`)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
eval(quote(`_fseq`(`_lhs`)), env, env)
2.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1.
Example %>% gather(key, value, -(2:8), -Degree_Type) %>% unite(key, key, Degree_Type) %>% spread(key, value)

我正在使用一个Excel文件,其中包含过去5年中某所大学的学生的毕业信息。我想对这些数据进行处理,以便获得带有所有已经获得学士学位但没有获得硕士学位的学生的学生证号码的输出。

Excel文件大致如下:

Student_ID | Last_Name | First_Names | Gender | Degree_Field | Degree_Level | Project_Type | Graduation_Date | Degree_Name
20120001   | Smith     | Jane Ellen  | F      | Education    | A            | Exam         | 30/06/2016      | B.A. in Secondary Education
20130002   | Yang      | Henry       | M      | Nursing      | A            | Internship   | 29/06/2018      | B.S. in Nursing
20120001   | Smith     | Jane Ellen  | F      | Education    | B            | Thesis       | 20/11/2018      | M.A. in Secondary Education

学士学位的学位等级为A,硕士学位的学位等级为B,博士学位的学位等级为C。我想用两种不同的方式来处理这些数据。首先,我想要一个统一表,每个Student_ID仅一行,但是我想为每个Degree_Level维护Degree_Field,Project_Type,Graduation_Date和Degree_Name,如下所示:

Student_ID | Last_Name | First_Names | Gender | Degree_Field_A | Project_Type_A | Graduation_Date_A | Degree_Name_A               | Degree_Field_B | Project_Type_B | Graduation_Date_B | Degree_Name_B
20120001   | Smith     | Jane Ellen  | F      | Education      | Exam           | 30/06/2016        | B.A. in Secondary Education | Educacation    | Thesis         | 20/11/2018        | M.A. in Secondary Education
20130002   | Yang      | Henry       | M      | Nursing        | Internship     | 29/06/2018        | B.S. in Nursing             | NA             | NA             | NA                | NA

请注意Jame Ellen Smith的完整记录,因为她先是学士学位,然后是硕士学位。但是,Henry Yang在与NA相关的所有领域中都有B,因为他尚未完成大师们呢。一旦获得了这种格式的数据,就应该很容易获得两个数据显示,一个显示用Degree_Field_A来计算该领域同时拥有学士和硕士学位的学生总数,另一个显示有多少学生拥有学士学位但没有硕士学位(换句话说,B字段是NA)。

编辑

我找到了一个类似问题的答案,尽管它很接近,但这并不能给我所需的结果。 https://stackoverflow.com/a/44958373/1709198对于像简·埃伦·史密斯(Jane Ellen Smith)这样的学生,它会按预期提供Degree_Field_1,Project_Type_1等,然后提供Degree_Field_2,Project_Field_2等。我的问题是,如果学生从ti获得学士学位

2 个答案:

答案 0 :(得分:1)

一个tidyverse选项将是首先以长格式gather数据,将uniteDegree_Level合并为列,然后spread将其恢复为宽格式

library(tidyverse)

df %>%
  gather(key, value, -(1:4), -Degree_Level) %>%
  unite(key, key, Degree_Level) %>%
  spread(key, value)  

#  Student_ID Last_Name First_Names Gender Degree_Field_A Degree_Field_B
#1   20120001     Smith  Jane Ellen      F      Education      Education
#2   20130002      Yang       Henry      M        Nursing           <NA>

#               Degree_Name_A                Degree_Name_B  Graduation_Date_A
# B.A. in Secondary Education  M.A. in Secondary Education         30/06/2016
#             B.S. in Nursing                         <NA>         29/06/2018

# Graduation_Date_B Project_Type_A Project_Type_B
#        20/11/2018           Exam         Thesis
#              <NA>     Internship           <NA>

数据

df <- structure(list(Student_ID = c("20120001", "20130002", "20120001"
), Last_Name = c("Smith", "Yang", "Smith"), First_Names = c("Jane Ellen", 
"Henry", "Jane Ellen"), Gender = c("F", "M", "F"), Degree_Field = 
c("Education", "Nursing", "Education"), Degree_Level = c("A", "A", "B"), 
Project_Type = c("Exam", "Internship", "Thesis"), 
Graduation_Date = c("30/06/2016", "29/06/2018","20/11/2018"), 
Degree_Name = c("B.A. in Secondary Education", "B.S. in Nursing", 
"M.A. in Secondary Education")), row.names = c(NA, -3L), class = "data.frame")

答案 1 :(得分:1)

我认为您只需通过melt链接dcastdata.table就能获得想要的输出。

IDvars<-c("Student_ID","Last_Name","First_Names","Gender")
MeasureVars<-c("Degree_Field","Project_Type","Graduation_Date","Degree_Name")

DT[,melt(.SD, measure.vars = MeasureVars )][,dcast(.SD,paste(paste0(IDvars,collapse = "+"),"~","Degree_Level","+","variable"))]

关于上面的代码的几点注释:

  1. 我假设您的data.table被称为DT,但如果不作相应更改,我仅指定了四个熔体测量变量。

  2. 执行融合代码将为您提供一个数据表,其中包含所有IDvars,degree_Level,默认情况下名为“ variable”的列(其中包含度量变量的名称)和默认情况下名为“ value”的列包含度量变量的值。

  3. 关于dcast公式的注释,我只是使用粘贴来避免键入以+分隔的所有IDvar。带有paste0参数的collapse在这里很有用。基本上,您需要在LHS上添加在一起的IDvar,在RHS上添加Degree_Level +'variable'。

  4. .SDdata.table中的特殊符号,它使您可以链接从融解中得到的临时结果而不保存它。

希望能有所帮助,让我知道我的解释是否清楚。祝你好运!

编辑:刚刚看到您更新了更现实的数据集。我用它重新运行了代码,它可以工作,但是您会收到警告,因为度量变量的类不一致。它们将被自动强制为character,因此不会对事物产生太大影响。这可能就是您在dplyr解决方案中遇到问题的原因。