我有一个很大的data.frame
,其中前三列包含有关标记的信息。其余列是每个人中该标记的数字类型。每个人都有三列。数据集如下所示:
marker alleleA alleleB X818 X818.1 X818.2 X345 X345.1 X345.2 X346 X346.1 X346.2
1 kgp5209280_chr3_21902067 T A 0.0000 1.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000
2 chr3_21902130_21902131_A_T A T 0.8626 0.1356 0.0018 0.7676 0.2170 0.0154 0.8626 0.1356 0.0018
3 chr3_21902134_21902135_T_C T C 0.6982 0.2854 0.0164 0.5617 0.3749 0.0634 0.6982 0.2854 0.0164
也就是说,对于每个标记(行),每个人都有三个值,每列一个。
我想创建一个新的data.frame
,它与原始行具有相同的行,但每个人只有一列。在每个人的一栏中,我希望每个人的三个值大于0.8。如果没有大于0.8的值,那么我想打印NA。例如,在我为第一行给出的数据集中,我希望第二个值为818(1.0000),第一个值为345(1.0000)。在第二行中,我想要第一个值为818(0.8626),而对于345,没有一个值高于0.8,所以我希望NA打印,依此类推。因此,新数据集如下所示:
marker alleleA alleleB X818 X345
1 kgp5209280_chr3_21902067 T A 1.0000 1
2 chr3_21902130_21902131_A_T A T 0.8626 NA
我一直在试图使用if/else
语句,但if [, 4] > 0.8 then [, 4], else...
但是它似乎没有给我我想要的东西,而且我也必须循环这个命令所以它不会不只是为前三列中的一个人而是针对所有列。
任何帮助将不胜感激!提前谢谢。
答案 0 :(得分:14)
data.table
版本> = 1.9.0中实施的快速融合/ dcast方法更新了解决方案。转到here了解更多信息。require(data.table)
require(reshape2)
dt <- as.data.table(df)
# melt data.table
dt.m <- melt(dt, id=c("marker", "alleleA", "alleleB"),
variable.name="id", value.name="val")
dt.m[, id := gsub("\\.[0-9]+$", "", id)] # replace `.[0-9]` with nothing
# aggregation
dt.m <- dt.m[, list(alleleA = alleleA[1],
alleleB = alleleB[1], val = max(val)),
keyby=list(marker, id)][val <= 0.8, val := NA]
# casting back
dt.c <- dcast.data.table(dt.m, marker + alleleA + alleleB ~ id)
# marker alleleA alleleB X345 X346 X818
# 1: chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
# 2: chr3_21902134_21902135_T_C T C NA NA NA
# 3: kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
解决方案1:可能不是最好的方式,但这是我现在能想到的:
mm <- t(apply(df[-(1:3)], 1, function(x) tapply(x, gl(3,3), max)))
mode(mm) <- "numeric"
mm[mm < 0.8] <- NA
# you can set the column names of mm here if necessary
out <- cbind(df[, 1:3], mm)
# marker alleleA alleleB 1 2 3
# 1 kgp5209280_chr3_21902067 T A 1.0000 1 1.0000
# 2 chr3_21902130_21902131_A_T A T 0.8626 NA 0.8626
# 3 chr3_21902134_21902135_T_C T C NA NA NA
gl(3,3)
提供值为1,1,1,2,2,2,3,3,3
且值为1,2,3
的因子。也就是说,tapply
将一次取值x
3并获得max
(前3,后3和后3)。 apply
逐个发送每一行。
解决方案2: data.table
melt
cast
解决方案data.table
<{1>} {<}> 1>} reshape
}或reshape2
:
require(data.table)
dt <- data.table(df)
# melt your data.table to long format
dt.melt <- dt[, list(id = names(.SD), val = unlist(.SD)),
by=list(marker, alleleA, alleleB)]
# replace `.[0-9]` with nothing
dt.melt[, id := gsub("\\.[0-9]+$", "", id)]
# get max value grouping by marker and id
dt.melt <- dt.melt[, list(alleleA = alleleA[1],
alleleB = alleleB[1],
val = max(val)),
keyby=list(marker, id)][val <= 0.8, val := NA]
# edit mnel (use setattr(,'names') to avoid copy by `names<-` within `setNames`
dt.cast <- dt.melt[, as.list(setattr(val,'names', id)),
by=list(marker, alleleA, alleleB)]
# marker alleleA alleleB X345 X346 X818
# 1: chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
# 2: chr3_21902134_21902135_T_C T C NA NA NA
# 3: kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
答案 1 :(得分:3)
我认为将数据放在长格式中会更好。这里有一个基于reshape2
包的解决方案,可能类似于第二个@Arun解决方案但语法不同
library(reshape2)
dat.m <- melt(dat,id.vars=1:3)
dat.m$variable <- gsub('[.].*','',dat.m$variable)
dcast(dat.m,...~variable,fun.aggregate=function(x){
res <- NA_real_
if(length(x) > 0 && max(x)> 0.8)
res <- max(x)
res
})
marker alleleA alleleB X345 X346 X818
1 chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
2 chr3_21902134_21902135_T_C T C NA NA NA
3 kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
答案 2 :(得分:1)
这是我使用函数pmax
的方法。请注意,如果每个人有超过0.8的两个或更多值,则会给出最大值:
df <- read.table(textConnection(" marker alleleA alleleB X818 X818.1 X818.2 X345 X345.1 X345.2 X346 X346.1 X346.2
1 kgp5209280_chr3_21902067 T A 0.0000 1.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000
2 chr3_21902130_21902131_A_T A T 0.8626 0.1356 0.0018 0.7676 0.2170 0.0154 0.8626 0.1356 0.0018
3 chr3_21902134_21902135_T_C T C 0.6982 0.2854 0.0164 0.5617 0.3749 0.0634 0.6982 0.2854 0.0164"), header=TRUE)
#data.table solution
library(data.table)
DT <- as.data.table(df)
DT[, M818 := ifelse(pmax(X818, X818.1, X818.2) > 0.8, pmax(X818, X818.1, X818.2), NA)]
DT[, M345 := ifelse(pmax(X345, X345.1, X345.2) > 0.8, pmax(X345, X345.1, X345.2), NA)]
DT[, M346 := ifelse(pmax(X346, X346.1, X346.2) > 0.8, pmax(X346, X346.1, X346.2), NA)]
#Base R solution
df$M818 <- ifelse(pmax(df$X818, df$X818.1, df$X818.2) > 0.8, pmax(df$X818, df$X818.1, df$X818.2), NA)
df$M345 <- ifelse(pmax(df$X345, df$X345.1, df$X345.2) > 0.8, pmax(df$X345, df$X345.1, df$X345.2), NA)
df$M346 <- ifelse(pmax(df$X346, df$X346.1, df$X346.2) > 0.8, pmax(df$X346, df$X346.1, df$X346.2), NA)
如果您想摆脱其他列,只需输入:
DT[, list(marker, alleleA, alleleB, M818, M345, M346)]
marker alleleA alleleB M818 M345 M346
1: kgp5209280_chr3_21902067 T A 1.0000 1 1.0000
2: chr3_21902130_21902131_A_T A T 0.8626 NA 0.8626
3: chr3_21902134_21902135_T_C T C NA NA NA
答案 3 :(得分:0)
这是另一种可能的解决方案。以上所有解决方案均有效。
我的解决方案是在不使用新库的情况下为您的区分大小写创建一个函数。它很长并且可以压缩,但是为了理解函数的工作原理,查看每个步骤是很有用的。
olddf <- data.frame(marker = c("kgp5209280_chr3_21902067",
"chr3_21902130_21902131_A_T",
"chr3_21902134_21902135_T_C"),
alleleA = c("T","A","T"),
alleleB = c("A","T","C"),
X818 = c(0.0000,0.8626,0.6982),
X818.1 = c(1.0000,0.1356,0.2854),
X818.2 = c(0.0000,0.0018,0.0164),
X345 = c(1.0000,0.7676, 0.5617),
X345.1 = c(0.0000, 0.2170, 0.3749),
X345.2 = c(0.0000, 0.0154, 0.0634),
X346 = c(0.0000, 0.8626, 0.6982),
X346.1 = c(1.0000,0.1356, 0.2854),
X346.2 = c(0.0000, 0.0018, 0.0164))
mergeallele <- function(arguments,threshold = 0.8){
n <- nrow(arguments)
# Creation of a results object as an empty list of length NROW
# speed for huge data.frame
new.lst <- vector(mode="list", n)
for (i in 1:n){
marker_row <- arguments[i,]
colvalue.4 <- NaN
if (max(marker_row[,c(4:6)]) < threshold){
colvalue.4 <- max(marker_row[,c(4:6)])
}
colvalue.5 <- NaN
if (max(marker_row[,c(7:9)]) < threshold){
colvalue.5 <- max(marker_row[,c(7:9)])
}
colvalue.6 <- NaN
if (max(marker_row[,c(10:12)]) < threshold){
colvalue.6 <- max(marker_row[,c(10:12)])
}
new.lst[[i]] <- data.frame(marker_row[,1],
marker_row[,2],
marker_row[,3],
colvalue.4,
colvalue.5,
colvalue.6)
}
new.df <- as.data.frame(do.call("rbind",new.lst))
names(new.df) <- c(colnames(arguments)[1],
colnames(arguments)[2],
colnames(arguments)[3],
colnames(arguments)[4],
colnames(arguments)[7],
colnames(arguments)[10])
return(new.df)
}
newdf <- mergeallele(olddf)
marker alleleA alleleB X818 X345 X346
1 kgp5209280_chr3_21902067 T A NaN NaN NaN
2 chr3_21902130_21902131_A_T A T NaN 0.7676 NaN
3 chr3_21902134_21902135_T_C T C 0.6982 0.5617 0.6982
关于:
threshold = 0.8
您可以设置您的阈值(例如:0.8)避免更改函数内的变量
new.lst <- vector(mode="list", n)
你可以创建一个空的旧数据列表列表,然后逐渐用循环结果填充列表的元素(更快)。请参阅此Blog
的测试速度