嗨, 我在A列中有一个具有唯一ID变量的数据集,然后为每个患者进行了肾脏扫描。这是一个csv文件,如果可能的话,我想使用R将其重塑为长格式。 每个参与者可以进行1-17次肾脏扫描。
还有一些ID被列为“否”,以表示未接收扫描。 我希望将其重塑为类似的内容
我知道按年份组织的以前的问题,我从参与者那里扫描,这些扫描在年份日期格式yyyy-mm-dd中多次出现
请在下面查看数据
structure(list(id = c(1010001, 1010002, 1010004, 1010005, 1010006,
1010007), `GFR Scans?` = c("Yes", "Yes", "Yes", "Yes", "Yes",
"No"), `1. Date of renal scan:` = structure(c(1133913600, 1196812800,
1237334400, 1124150400, 1192060800, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `1. Type of renal scan:` = c("DTPA",
"DTPA", "DTPA", "DTPA", "DTPA", NA), `1. GFR mL/1.73 sq.m` = c(18,
13, 68, 117, 46, NA), `1. Pre/Post tx?` = c("Pre", "Pre", "Post",
"Post", "Pre", NA), `2. Date of renal scan:` = structure(c(1146528000,
1214524800, NA, 1151366400, 1245974400, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `2. Type of renal scan:` = c("DTPA",
"DTPA", NA, "DTPA", "DTPA", NA), `2. GFR mL/1.73 sq.m` = c(86,
110, NA, 148, 123, NA), `2. Pre/Post tx?` = c("Post", "Post",
NA, "Post", "Post", NA), `3. Date of renal scan:` = structure(c(NA,
1219104000, NA, 1184025600, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `3. Type of renal scan:` = c(NA, "DTPA", NA,
"DTPA", NA, NA), `3. GFR mL/1.73 sq.m` = c(NA, 92, NA, 166, NA,
NA), `3. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `4. Date of renal scan:` = structure(c(NA,
1242691200, NA, 1213660800, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `4. Type of renal scan:` = c(NA, "DTPA", NA,
"DTPA", NA, NA), `4. GFR mL/1.73 sq.m` = c(NA, 36, NA, 171, NA,
NA), `4. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `5. Date of renal scan:` = structure(c(NA,
NA, NA, 1288656000, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`5. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA),
`5. GFR mL/1.73 sq.m` = c(NA, NA, NA, 105, NA, NA), `5. Pre/Post tx?` = c(NA,
NA, NA, "Post", NA, NA), `6. Date of renal scan:` = structure(c(NA,
NA, NA, 1323129600, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `6. Type of renal scan:` = c(NA, NA, NA,
"DTPA", NA, NA), `6. GFR mL/1.73 sq.m` = c(NA, NA, NA, 103,
NA, NA), `6. Pre/Post tx?` = c(NA, NA, NA, "Post", NA, NA
), `7. Date of renal scan:` = structure(c(NA, NA, NA, 1355184000,
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`7. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA),
`7. GFR mL/1.73 sq.m` = c(NA, NA, NA, 98, NA, NA), `7. Pre/Post tx?` = c(NA,
NA, NA, "Post", NA, NA), `8. Date of renal scan:` = c(NA,
NA, NA, NA, NA, NA), `8. Type of renal scan:` = c(NA, NA,
NA, NA, NA, NA), `8. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA,
NA, NA), `8. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `9. Date of renal scan:` = c(NA,
NA, NA, NA, NA, NA), `9. Type of renal scan:` = c(NA, NA,
NA, NA, NA, NA), `9. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA,
NA, NA), `9. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `10. Date of renal scan:` = c(NA,
NA, NA, NA, NA, NA), `10. Type of renal scan:` = c(NA, NA,
NA, NA, NA, NA), `10. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA,
NA, NA), `10. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA),
`11. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `11. Type of renal scan:` = c(NA,
NA, NA, NA, NA, NA), `11. GFR mL/1.73 sq.m` = c(NA, NA, NA,
NA, NA, NA), `11. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
), `12. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA),
`12. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `12. GFR mL/1.73 sq.m` = c(NA,
NA, NA, NA, NA, NA), `12. Pre/Post tx?` = c(NA, NA, NA, NA,
NA, NA), `13. Date of renal scan:` = c(NA, NA, NA, NA, NA,
NA), `13. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA
), `13. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, NA), `13. Pre/Post tx?` = c(NA,
NA, NA, NA, NA, NA), `14. Date of renal scan:` = c(NA, NA,
NA, NA, NA, NA), `14. Type of renal scan:` = c(NA, NA, NA,
NA, NA, NA), `14. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA,
NA), `14. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `15. Date of renal scan:` = c(NA,
NA, NA, NA, NA, NA), `15. Type of renal scan:` = c(NA, NA,
NA, NA, NA, NA), `15. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA,
NA, NA), `15. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA),
`16. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `16. Type of renal scan:` = c(NA,
NA, NA, NA, NA, NA), `16. GFR mL/1.73 sq.m` = c(NA, NA, NA,
NA, NA, NA), `16. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
), `17. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA),
`17. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `17. GFR mL/1.73 sq.m` = c(NA,
NA, NA, NA, NA, NA), `17. Pre/Post tx?` = c(NA, NA, NA, NA,
NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
第一个图像是原始的宽幅格式,第二个图像是我想要的图像。由于涉及多个专栏,因此没有其他的答案可以帮助我。
例如id 1010001进行了两次扫描,我需要此扫描一个接一个地列出,而不是彼此并列(参见图2)。
非常感谢您的帮助。
答案 0 :(得分:3)
此问题之前已经问过几次,例如Reshaping multiple sets of measurement columns (wide format) into single columns (long format)。一种可能的方法是使用data.table
的{{1}}函数,该函数能够同时重塑多个值列。
但是,这是一个额外的困难,它本身就是一个正确的答案,恕我直言。 列名称有时会包含多余的空格,需要事先删除这些多余的空格,以使列的命名模式保持一致。
melt()
names(df1)
[1] "id" "GFR Scans?" "1. Date of renal scan:" "1. Type of renal scan:"
[5] "1. GFR mL/1.73 sq.m" "1. Pre/Post tx?" "2. Date of renal scan:" "2. Type of renal scan:"
[9] "2. GFR mL/1.73 sq.m" "2. Pre/Post tx?" "3. Date of renal scan:" "3. Type of renal scan:"
[13] "3. GFR mL/1.73 sq.m" "3. Pre/Post tx?" "4. Date of renal scan:" "4. Type of renal scan:"
[17] "4. GFR mL/1.73 sq.m" "4. Pre/Post tx?" "5. Date of renal scan:" "5. Type of renal scan:"
[21] "5. GFR mL/1.73 sq.m" "5. Pre/Post tx?" "6. Date of renal scan:" "6. Type of renal scan:"
[25] "6. GFR mL/1.73 sq.m" "6. Pre/Post tx?" "7. Date of renal scan:" "7. Type of renal scan:"
[29] "7. GFR mL/1.73 sq.m" "7. Pre/Post tx?" "8. Date of renal scan:" "8. Type of renal scan:"
[33] "8. GFR mL/1.73 sq.m" "8. Pre/Post tx?" "9. Date of renal scan:" "9. Type of renal scan:"
[37] "9. GFR mL/1.73 sq.m" "9. Pre/Post tx?" "10. Date of renal scan:" "10. Type of renal scan:"
[41] "10. GFR mL/1.73 sq.m" "10. Pre/Post tx?" "11. Date of renal scan:" "11. Type of renal scan:"
[45] "11. GFR mL/1.73 sq.m" "11. Pre/Post tx?" "12. Date of renal scan:" "12. Type of renal scan:"
[49] "12. GFR mL/1.73 sq.m" "12. Pre/Post tx?" "13. Date of renal scan:" "13. Type of renal scan:"
[53] "13. GFR mL/1.73 sq.m" "13. Pre/Post tx?" "14. Date of renal scan:" "14. Type of renal scan:"
[57] "14. GFR mL/1.73 sq.m" "14. Pre/Post tx?" "15. Date of renal scan:" "15. Type of renal scan:"
[61] "15. GFR mL/1.73 sq.m" "15. Pre/Post tx?" "16. Date of renal scan:" "16. Type of renal scan:"
[65] "16. GFR mL/1.73 sq.m" "16. Pre/Post tx?" "17. Date of renal scan:" "17. Type of renal scan:"
library(data.table) library(magrittr) # clean up column names: remove surplus whitespace setDT(df1) %>% setnames(names(.) %>% stringr::str_replace_all("\\s+", " ")) # get name pattern for subsequent melt cols <- names(df1)[3:6] %>% stringr::str_replace("1. ", "") # reshape multiple columns from wide to long long <- melt(df1, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[ # recreate lost POSIXct attribute , `Date of renal scan:` := lubridate::as_datetime(`Date of renal scan:`)][] long
在对 id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
1: 1010001 Yes 1 2005-12-07 DTPA 18 Pre
2: 1010002 Yes 1 2007-12-05 DTPA 13 Pre
3: 1010004 Yes 1 2009-03-18 DTPA 68 Post
4: 1010005 Yes 1 2005-08-16 DTPA 117 Post
5: 1010006 Yes 1 2007-10-11 DTPA 46 Pre
6: 1010001 Yes 2 2006-05-02 DTPA 86 Post
7: 1010002 Yes 2 2008-06-27 DTPA 110 Post
8: 1010005 Yes 2 2006-06-27 DTPA 148 Post
9: 1010006 Yes 2 2009-06-26 DTPA 123 Post
10: 1010002 Yes 3 2008-08-19 DTPA 92 Post
11: 1010005 Yes 3 2007-07-10 DTPA 166 Post
12: 1010002 Yes 4 2009-05-19 DTPA 36 Post
13: 1010005 Yes 4 2008-06-17 DTPA 171 Post
14: 1010005 Yes 5 2010-11-02 DTPA 105 Post
15: 1010005 Yes 6 2011-12-06 DTPA 103 Post
16: 1010005 Yes 7 2012-12-11 DTPA 98 Post
的调用中,我们可以设置参数melt()
以保留所有数据:
na.rm = FALSE
答案 1 :(得分:2)
这是一个可行的解决方案,不是最佳解决方案,而是可行的。策略是从宽变长到整齐。
当从原始的宽格式转换为长格式时,所有列都转换为最低通用格式(在这种情况下为字符),因此最后需要转换列。
为了删除带有NA的行,我使用complete.cases
,因此您的最后一个ID 1010007不在最终输出中。如果出现问题,则应调整NA清理步骤的位置。
library(tidyr)
library(dplyr)
#convert from wide to long
new<-gather(df,key = "key", value = "value", -id, -`GFR Scans?`)
#clean up the key column
new$key<-sub("[0-9]+\\. ", "", new$key)
new$key<-gsub("[ ]+", " ", new$key)
# verify column headings (should only be 4)
unique(new$key)
#remove the rows with NA
new<-new[complete.cases(new),]
#now go from long to slightly wide
answer<-new %>% group_by( id, `GFR Scans?`, key) %>% mutate(testnum=row_number()) %>% spread(key, value)
#convert the colmns back to the proper type
answer$`Date of renal scan:`<-as.POSIXct(as.numeric(answer$`Date of renal scan:`), origin="1970-01-01", tz="UTC")
answer$`GFR mL/1.73 sq.m`<-as.numeric(answer$`GFR mL/1.73 sq.m`)
answer
# id `GFR Scans?` testnum `Date of renal scan:` `GFR mL/1.73 sq.m` `Pre/Post tx?` `Type of renal scan:`
# <dbl> <chr> <int> <dttm> <dbl> <chr> <chr>
# 1 1010001 Yes 1 2005-12-07 00:00:00 18 Pre DTPA
# 2 1010001 Yes 2 2006-05-02 00:00:00 86 Post DTPA
# 3 1010002 Yes 1 2007-12-05 00:00:00 13 Pre DTPA
# 4 1010002 Yes 2 2008-06-27 00:00:00 110 Post DTPA
# 5 1010002 Yes 3 2008-08-19 00:00:00 92 Post DTPA
# 6 1010002 Yes 4 2009-05-19 00:00:00 36 Post DTPA
# 7 1010004 Yes 1 2009-03-18 00:00:00 68 Post DTPA
# 8 1010005 Yes 1 2005-08-16 00:00:00 117 Post DTPA