如何使用数据集创建整洁的数据,其中值在多行上重复

时间:2014-12-05 22:40:03

标签: r

很抱歉,如果这个例子太大了。它确实看起来更真实,但我很难想出一个能够更好地解释我的情况的例子。

我想要的是是一个整洁的data.frame,我可以在摘要(avg)和情节中使用医疗条件 (编辑)的 我需要回答的问题我是否正确地尝试完成此操作。我是否想要一个带有巨大字符串的行,其值由逗号分隔?我需要将其拆分为更多列吗?

来自我们的数据库供应商的报告(实际数据已更改)。 报告没有提供唯一的密钥。在我的data.frames中,person.id在某些内容中是唯一的,而其他人就是这样,有多行person.id和值。

person.id <- c("1017", "1018", "1018", "1018", "1018", "1018", "1018",
               "1018", "1018", "1018", "1018", "1019", "1019", "1020",
               "1020")
med.condition <- c(NA, "Allergic rhinitis", "Allergic rhinitis",
                   "Atopic Dermatitis", "Atopic Dermatitis",
                   "Developmental Speech",
                   "Developmental Speech",
                   "Eye Condition", "Eye Condition", "Speech Delay",
                   "Speech Delay", "Allergic Reaction", NA, "Eczema",
                   "Obese")
cond.type <- c("Assessment", "Assessment", NA, "Assessment", NA, "Assessment",
               NA, "Assessment", NA, "Assessment", NA, "Assessment",
               "Assessment", "Assessment", "Assessment")
df <- data.frame(person.id, med.condition, cond.type)

看起来像:

  person.id  med.condition                              cond.type
1   1017    NA                                          Assessment
2   1018    Allergic rhinitis                           Assessment
3   1018    Allergic rhinitis                           NA
4   1018    Atopic Dermatitis                           Assessment
5   1018    Atopic Dermatitis                           NA
6   1018    Developmental Speech                        Assessment
7   1018    Developmental Speech                        NA
8   1018    Eye Condition                               Assessment
9   1018    Eye Condition                               NA
10  1018    Speech Delay                                Assessment
11  1018    Speech Delay                                NA
12  1019    Allergic Reaction                           Assessment
13  1019    NA                                          Assessment
14  1020    Eczema                                      Assessment
15  1020    Obese                                       Assessment

我希望行等于一个人.id

我是否希望它看起来像这样(仅显示前5列):使用 taplly 在整洁时失败

    condition1         condition2        condition3        condition4           condition5
1017 NA                 NA                NA                NA                   NA
1018 Allergic rhinitis Atopic Dermatitis  Allergic Reaction Developmental Speech Eye Condition
1019 NA                 NA                NA                NA                   NA
1020 Eczema             Obese             NA                NA                   NA

如何使数据集整洁?

     med.condtion
1017 NA
1018 "Allergic rhinitis", "Atopic Dermatitis", "Developmental Speech", "Eye Condition", "Speech Delay", "Allergic Reaction" 
1019 NA
1020 "Eczema" "Obese"

或者我是否需要以新的方式考虑这个问题?

我厌倦了什么 tapply,reshape2

taplly 不在这个例子上工作,但在我的程序中做了抱歉

df2 <- data.frame(person.id, med.condition, cond.type)
df2.wide <- tapply(X = df2$medical.condition, INDEX = df2$person.id,
                        function(x){
                          ux <- unique(x)
                          c(ux, rep(x = NA, 9 - length (ux)))
                        })
df2.wide <- as.data.frame(do.call('rbind', df2.wide), stringsAsFactors = FALSE)
names(promis.b.temp) <- paste0('condition', 1:9)

cols&lt; - names(promis.b.temp) df2 $ med.all&lt; - apply(df2 [,cols],1,paste,collapse =&#34;,&#34;)

reshape2 很快就意识到这不会起作用     库(reshape2)     df3&lt; - test%&gt;%            熔化率(%)>%            unique()%&gt;%            铸(person.id)

  • 我是否正确处理了这个问题?
  • 当我使用字符串过滤器进行报告时,是否会出现问题?

3 个答案:

答案 0 :(得分:2)

我真的不明白这个问题。您的数据似乎已经过了整洁&#34;。

我注意到的两件事是(1)重复的值(可能会或可能不会想要)和(2)每个人和医疗条件缺乏唯一身份证。

如果你想要一个用逗号分隔的长字符串(在我看来很难处理),你可以按前两列中的唯一值进行聚合,如下所示:

library(data.table)
as.data.table(unique(df[1:2]))[, paste(med.condition, collapse = ","), by = person.id]
#    person.id                                                                                  V1
# 1:      1017                                                                                  NA
# 2:      1018 Allergic rhinitis,Atopic Dermatitis,Developmental Speech,Eye Condition,Speech Delay
# 3:      1019                                                                Allergic Reaction,NA
# 4:      1020                                                                        Eczema,Obese

如果您想轻松获取每个人的顺序ID,可以使用我的&#34; splitstackshape&#34;中的getanID。包:

library(splitstackshape)
getanID(as.data.table(unique(df[1:2]))

如果需要,可以使用dcast.data.table转换为宽格式,如下所示:

library(splitstackshape)
dcast.data.table(getanID(as.data.table(unique(df[1:2])), "person.id"), 
                 person.id ~ .id, value.var = "med.condition", 
                 fun.aggregate = function(x) paste(x, collapse = ","))
#    person.id                 1                 2                    3             4            5
# 1:      1017                NA                                                                  
# 2:      1018 Allergic rhinitis Atopic Dermatitis Developmental Speech Eye Condition Speech Delay
# 3:      1019 Allergic Reaction                NA                                                
# 4:      1020            Eczema             Obese                                                

答案 1 :(得分:1)

如果您只是为每个观察添加“时间”指示(您可以使用reshape()轻松完成),则可以使用基本ave()功能执行此操作。如果你运行

reshape(
    transform(
        unique(df[, c("person.id","med.condition")]), 
        time=ave(as.numeric(person.id), person.id, FUN=seq_along)
    ), 
    idvar="person.id", 
    v.names="med.condition",
    direction="wide")

你会得到

person.id   med.condition.1 med.condition.2 med.condition.3 med.condition.4 med.condition.5
1017    NA  NA  NA  NA  NA
1018    Allergic rhinitis   Atopic Dermatitis   Developmental Speech    Eye Condition   Speech Delay
1019    Allergic Reaction   NA  NA  NA  NA
1020    Eczema  Obese   NA  NA  NA

答案 2 :(得分:1)

您的数据框位于&#34; long&#34;格式,你想重塑它的宽度&#34;格式。尝试以下:

require(reshape2)
df.new <- reshape(df,idvar='person.id',timevar='cond.type',direction='wide')