我的数据集的组织如下所示(仅是一小部分):对于给定的主题(此处为subject = 5),我在D-1,D1-8h和D2-24h进行了3次测试:>
SUBJECT TIME TEST RESULT UNITS RANGES
591 5 D-1 Leukoyte count urine 1 /?L |-< 15|-
592 5 D-1 Erythrocyte count urine 0 /?L |-< 19|-
593 5 D-1 Glucose dipstick urine Normal None |+ from 50 mg/dL-|-
684 5 D1 8h Leukoyte count urine 0 /?L |-< 15|-
687 5 D1 8h Erythrocyte count urine 0 /?L |-< 19|-
683 5 D1 8h Glucose dipstick urine Normal None |+ from 50 mg/dL-|-
694 5 D2 24h Leukoyte count urine 1 /?L |-< 15|-
695 5 D2 24h Erythrocyte count urine 0 /?L |-< 19|-
696 5 D2 24h Glucose dipstick urine Normal None |+ from 50 mg/dL-|-
我想在由列设置的表格中以以下形式重新组织这些数据:
测试D-1 D1-8h D2-24h单位范围
这样我就可以通过测试。
我对“表”和“汇总”感到困惑,尽管我确信它并没有那么复杂,但是我没有找到合适的方法来实现这一点……
能给我一些帮助吗?
谢谢
这里是dput:
> dput(dataset)
structure(list(SUBJECT = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), TIME = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("D-1",
"D1 8h", "D2 24h", "D4 72h"), class = "factor"), TEST = structure(c(35L,
24L, 28L, 35L, 24L, 28L, 35L, 24L, 28L), .Label = c("", "Alkaline phosphatase",
"APTT", "Basophils", "Basophils (%)", "Calcium", "CD19", "CD19 abs.",
"CD3", "CD3 abs.", "CD4/CD8 ratio", "CD4+", "CD4+ abs.", "CD56",
"CD56 absolute", "CD8+", "CD8+ abs.", "Chloride", "CK (creatine kinase)",
"Creatinine", "Direct bilirubin (conjug)", "Eosinophils", "Eosinophils (%)",
"Erythrocyte count urine", "Erythrocyte dipstick urine", "Gamma GT",
"Glucose", "Glucose dipstick urine", "GOT (AST)", "GPT (ALT)",
"Hematocrit", "Hemoglobin", "Ketone bodies urine", "Leukocyte esterase urine",
"Leukoyte count urine", "Lymphocytes", "Lymphocytes (%)", "Monocytes",
"Monocytes (%)", "Neutrophils", "Neutrophils (%)", "pH urine",
"Platelet count", "Potassium", "Protein urine", "PT INR", "Red blood cell count",
"Reticulocytes", "Reticulocytes %", "Serum Albumine", "Sodium",
"Total bilirubin", "Total cholesterol", "Total protein", "Triglycerides",
"Urea", "Urine glucose quantitative", "Urine protein quantitative",
"White blood cell count"), class = "factor"), RESULT = c("1",
"0", "Normal", "0", "0", "Normal", "1", "0", "Normal"), UNITS = c("/?L",
"/?L", "None", "/?L", "/?L", "None", "/?L", "/?L", "None"), RANGES = c("|-< 15|-",
"|-< 19|-", "|+ from 50 mg/dL-|-", "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-",
"|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-")), .Names = c("SUBJECT",
"TIME", "TEST", "RESULT", "UNITS", "RANGES"), row.names = c(591L,
592L, 593L, 684L, 687L, 683L, 694L, 695L, 696L), class = "data.frame")
答案 0 :(得分:1)
是吗?如果是这样,我认为应该将其标记为reshape data from long to wide in R的副本。
library(tidyverse)
spread(dataset, key = TIME, value = UNITS)
# SUBJECT TEST RESULT RANGES D-1 D1 8h D2 24h
#1 5 Erythrocyte count urine 0 |-< 19|- /?L /?L /?L
#2 5 Glucose dipstick urine Normal |+ from 50 mg/dL-|- None None None
#3 5 Leukoyte count urine 0 |-< 15|- <NA> /?L <NA>
#4 5 Leukoyte count urine 1 |-< 15|- /?L <NA> /?L
编辑。
Peter_Evan在他的评论中纠正了上述问题。正确的解决方法是
spread(dataset, key = TIME, value = RESULT)
# SUBJECT TEST UNITS RANGES D-1 D1 8h D2 24h
#1 5 Erythrocyte count urine /?L |-< 19|- 0 0 0
#2 5 Glucose dipstick urine None |+ from 50 mg/dL-|- Normal Normal Normal
#3 5 Leukoyte count urine /?L |-< 15|- 1 0 1
或者,如果OP要重新排列列,请执行以下操作。
dataset %>%
spread(key = TIME, value = RESULT) %>%
select(SUBJECT,TEST, `D-1`:`D2 24h`, UNITS, RANGES)
# SUBJECT TEST D-1 D1 8h D2 24h UNITS RANGES
#1 5 Erythrocyte count urine 0 0 0 /?L |-< 19|-
#2 5 Glucose dipstick urine Normal Normal Normal None |+ from 50 mg/dL-|-
#3 5 Leukoyte count urine 1 0 1 /?L |-< 15|-
答案 1 :(得分:0)
我相信您正在要求一个dcast()
的简单直接的实现,该实现将数据从长到宽进行处理。这是一个使用data.table
包的实现。
library(data.table)
#> Warning: package 'data.table' was built under R version 3.4.4
x <- structure(list(SUBJECT = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), TIME = c("D-1", "D-1", "D-1", "D1 8h", "D1 8h", "D1 8h", "D2 24h",
"D2 24h", "D2 24h"), TEST = c("Leukoyte count urine", "Erythrocyte count urine",
"Glucose dipstick urine", "Leukoyte count urine", "Erythrocyte count urine",
"Glucose dipstick urine", "Leukoyte count urine", "Erythrocyte count urine",
"Glucose dipstick urine"), RESULT = c("1", "0", "Normal", "0",
"0", "Normal", "1", "0", "Normal"), UNITS = c("/?L", "/?L", "None",
"/?L", "/?L", "None", "/?L", "/?L", "None"), RANGES = c("|-< 15|-",
"|-< 19|-", "|+ from 50 mg/dL-|-", "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-",
"|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-")), .Names = c("SUBJECT",
"TIME", "TEST", "RESULT", "UNITS", "RANGES"), row.names = c(NA,
-9L), class = c("data.table", "data.frame"))
dcast(SUBJECT + TEST ~ TIME, data = x, value.var = c("UNITS", "RANGES"))
#> SUBJECT TEST UNITS_D-1 UNITS_D1 8h UNITS_D2 24h
#> 1: 5 Erythrocyte count urine /?L /?L /?L
#> 2: 5 Glucose dipstick urine None None None
#> 3: 5 Leukoyte count urine /?L /?L /?L
#> RANGES_D-1 RANGES_D1 8h RANGES_D2 24h
#> 1: |-< 19|- |-< 19|- |-< 19|-
#> 2: |+ from 50 mg/dL-|- |+ from 50 mg/dL-|- |+ from 50 mg/dL-|-
#> 3: |-< 15|- |-< 15|- |-< 15|-
由reprex package(v0.2.1)于2019-02-23创建
也许这就是您想要的(如果不想,请在问题中输入预期的输出,以避免所有人猜测):
dcast(SUBJECT + TEST + UNITS + RANGES ~ TIME, data = df, value.var = "RESULT")
SUBJECT TEST UNITS RANGES D-1 D1 8h D2 24h
1 5 Erythrocyte count urine /?L |-< 19|- 0 0 0
2 5 Glucose dipstick urine None |+ from 50 mg/dL-|- Normal Normal Normal
3 5 Leukoyte count urine /?L |-< 15|- 1 0 1