很抱歉,如果这个例子太大了。它确实看起来更真实,但我很难想出一个能够更好地解释我的情况的例子。
我想要的是是一个整洁的data.frame,我可以在摘要(avg)和情节中使用医疗条件 (编辑)的 我需要回答的问题我是否正确地尝试完成此操作。我是否想要一个带有巨大字符串的行,其值由逗号分隔?我需要将其拆分为更多列吗?
来自我们的数据库供应商的报告(实际数据已更改)。 报告没有提供唯一的密钥。在我的data.frames中,person.id在某些内容中是唯一的,而其他人就是这样,有多行person.id和值。
person.id <- c("1017", "1018", "1018", "1018", "1018", "1018", "1018",
"1018", "1018", "1018", "1018", "1019", "1019", "1020",
"1020")
med.condition <- c(NA, "Allergic rhinitis", "Allergic rhinitis",
"Atopic Dermatitis", "Atopic Dermatitis",
"Developmental Speech",
"Developmental Speech",
"Eye Condition", "Eye Condition", "Speech Delay",
"Speech Delay", "Allergic Reaction", NA, "Eczema",
"Obese")
cond.type <- c("Assessment", "Assessment", NA, "Assessment", NA, "Assessment",
NA, "Assessment", NA, "Assessment", NA, "Assessment",
"Assessment", "Assessment", "Assessment")
df <- data.frame(person.id, med.condition, cond.type)
看起来像:
person.id med.condition cond.type
1 1017 NA Assessment
2 1018 Allergic rhinitis Assessment
3 1018 Allergic rhinitis NA
4 1018 Atopic Dermatitis Assessment
5 1018 Atopic Dermatitis NA
6 1018 Developmental Speech Assessment
7 1018 Developmental Speech NA
8 1018 Eye Condition Assessment
9 1018 Eye Condition NA
10 1018 Speech Delay Assessment
11 1018 Speech Delay NA
12 1019 Allergic Reaction Assessment
13 1019 NA Assessment
14 1020 Eczema Assessment
15 1020 Obese Assessment
我希望行等于一个人.id
我是否希望它看起来像这样(仅显示前5列):使用 taplly 在整洁时失败
condition1 condition2 condition3 condition4 condition5
1017 NA NA NA NA NA
1018 Allergic rhinitis Atopic Dermatitis Allergic Reaction Developmental Speech Eye Condition
1019 NA NA NA NA NA
1020 Eczema Obese NA NA NA
如何使数据集整洁?
med.condtion
1017 NA
1018 "Allergic rhinitis", "Atopic Dermatitis", "Developmental Speech", "Eye Condition", "Speech Delay", "Allergic Reaction"
1019 NA
1020 "Eczema" "Obese"
或者我是否需要以新的方式考虑这个问题?
我厌倦了什么 tapply,reshape2
taplly 不在这个例子上工作,但在我的程序中做了抱歉
df2 <- data.frame(person.id, med.condition, cond.type)
df2.wide <- tapply(X = df2$medical.condition, INDEX = df2$person.id,
function(x){
ux <- unique(x)
c(ux, rep(x = NA, 9 - length (ux)))
})
df2.wide <- as.data.frame(do.call('rbind', df2.wide), stringsAsFactors = FALSE)
names(promis.b.temp) <- paste0('condition', 1:9)
cols&lt; - names(promis.b.temp) df2 $ med.all&lt; - apply(df2 [,cols],1,paste,collapse =&#34;,&#34;)
reshape2 很快就意识到这不会起作用 库(reshape2) df3&lt; - test%&gt;% 熔化率(%)>% unique()%&gt;% 铸(person.id)
答案 0 :(得分:2)
我真的不明白这个问题。您的数据似乎已经过了整洁&#34;。
我注意到的两件事是(1)重复的值(可能会或可能不会想要)和(2)每个人和医疗条件缺乏唯一身份证。
如果你想要一个用逗号分隔的长字符串(在我看来很难处理),你可以按前两列中的唯一值进行聚合,如下所示:
library(data.table)
as.data.table(unique(df[1:2]))[, paste(med.condition, collapse = ","), by = person.id]
# person.id V1
# 1: 1017 NA
# 2: 1018 Allergic rhinitis,Atopic Dermatitis,Developmental Speech,Eye Condition,Speech Delay
# 3: 1019 Allergic Reaction,NA
# 4: 1020 Eczema,Obese
如果您想轻松获取每个人的顺序ID,可以使用我的&#34; splitstackshape&#34;中的getanID
。包:
library(splitstackshape)
getanID(as.data.table(unique(df[1:2]))
如果需要,可以使用dcast.data.table
转换为宽格式,如下所示:
library(splitstackshape)
dcast.data.table(getanID(as.data.table(unique(df[1:2])), "person.id"),
person.id ~ .id, value.var = "med.condition",
fun.aggregate = function(x) paste(x, collapse = ","))
# person.id 1 2 3 4 5
# 1: 1017 NA
# 2: 1018 Allergic rhinitis Atopic Dermatitis Developmental Speech Eye Condition Speech Delay
# 3: 1019 Allergic Reaction NA
# 4: 1020 Eczema Obese
答案 1 :(得分:1)
如果您只是为每个观察添加“时间”指示(您可以使用reshape()
轻松完成),则可以使用基本ave()
功能执行此操作。如果你运行
reshape(
transform(
unique(df[, c("person.id","med.condition")]),
time=ave(as.numeric(person.id), person.id, FUN=seq_along)
),
idvar="person.id",
v.names="med.condition",
direction="wide")
你会得到
person.id med.condition.1 med.condition.2 med.condition.3 med.condition.4 med.condition.5
1017 NA NA NA NA NA
1018 Allergic rhinitis Atopic Dermatitis Developmental Speech Eye Condition Speech Delay
1019 Allergic Reaction NA NA NA NA
1020 Eczema Obese NA NA NA
答案 2 :(得分:1)
您的数据框位于&#34; long&#34;格式,你想重塑它的宽度&#34;格式。尝试以下:
require(reshape2)
df.new <- reshape(df,idvar='person.id',timevar='cond.type',direction='wide')