我遇到了数据帧问题。比方说,我有一个数据框,其中一列包含值(范围0到100000)。一个例子:
TCGA.CG.4462
ENSG00000000003 4.7574661
ENSG00000000005 0.0000000
ENSG00000000419 24.1066335
ENSG00000000457 2.7631012
ENSG00000000460 0.8928772
我想通过以下5个类别添加一个包含列数据概率的新列:
因此,对于该示例,我想要在新列中添加的值是:
所以我的数据框就像这样:
TCGA.CG.4462 Prob
ENSG00000000003 4.7574661 0.4
ENSG00000000005 0.0000000 0.2
ENSG00000000419 24.1066335 0.2
ENSG00000000457 2.7631012 0.4
ENSG00000000460 0.8928772 0.2
我已经尝试了很多不同的方法,但到目前为止还没有。我认为if条件是解决我的问题的最佳方法,但if条件给出错误,因为条件的长度是> 1。 谁能告诉我最新的解决方法呢?
答案 0 :(得分:1)
我们可以使用cut
来查找间隔并使用所需的概率标记它们。由于概率中存在重复,因此会出现警告消息,可以忽略。请参阅下面的演示:
library(data.table)
cut(df1$TCGA.CG.4462, breaks = c(-Inf, 0, 1, 10, 100, Inf), include.lowest = TRUE)
# [1] (1,10] [-Inf,0] (10,100] (1,10] (0,1]
# Levels: [-Inf,0] (0,1] (1,10] (10,100] (100, Inf]
df1[, prob := as.numeric(as.character(cut(TCGA.CG.4462,
breaks = c(-Inf, 0, 1, 10, 100, Inf),
include.lowest = TRUE,
labels = c(0.2, 0.2, 0.4, 0.2, 0.0))))]
# Warning message:
# In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
# else paste0(labels, : duplicated levels in factors are deprecated
df1
# genes TCGA.CG.4462 prob
# 1: ENSG00000000003 4.7574661 0.4
# 2: ENSG00000000005 0.0000000 0.2
# 3: ENSG00000000419 24.1066335 0.2
# 4: ENSG00000000457 2.7631012 0.4
# 5: ENSG00000000460 0.8928772 0.2
使用基础R(未使用包)
df1 <- within(df1, prob <- as.numeric(as.character(cut(TCGA.CG.4462,
breaks = c(-Inf, 0, 1, 10, 100, Inf),
include.lowest = TRUE,
labels = c(0.2, 0.2, 0.4, 0.2, 0.0)))))
数据:强>
library(data.table)
df1 <- fread('ENSG00000000003 4.7574661
ENSG00000000005 0.0000000
ENSG00000000419 24.1066335
ENSG00000000457 2.7631012
ENSG00000000460 0.8928772', header = F)
colnames(df1) <- c("genes", "TCGA.CG.4462")
编辑:第三栏:将值1添加到“第三列”
data.table包
df1[, `:=` ( prob = as.numeric(as.character(cut(TCGA.CG.4462,
breaks = c(-Inf, 0, 1, 10, 100, Inf),
include.lowest = TRUE,
labels = c(0.2, 0.2, 0.4, 0.2, 0.0)))),
third = 1)]
基础R:
within(df1, c(prob <- as.numeric(as.character(cut(TCGA.CG.4462,
breaks = c(-Inf, 0, 1, 10, 100, Inf),
include.lowest = TRUE,
labels = c(0.2, 0.2, 0.4, 0.2, 0.0)))),
third <- 1))
答案 1 :(得分:0)
以下是另一个data.table
解决方案,该解决方案在非等连接中使用查找表和更新:
library(data.table)
# create lookup table
lookup <- data.table(
expression = c("non", "low", "normal", "high", "very_high"),
Prob = c(0.2, 0.2, 0.4, 0.2, 0.0),
lower = c(-Inf, 0, 10^(0:2))
)
lookup[, upper := shift(lower, type = "lead", fill = Inf)][]
expression Prob lower upper 1: non 0.2 -Inf 0 2: low 0.2 0 1 3: normal 0.4 1 10 4: high 0.2 10 100 5: very_high 0.0 100 Inf
# update in a non-equi join
# note the left open intervals
setDT(DT)[lookup, on = .(TCGA.CG.4462 > lower, TCGA.CG.4462 <= upper),
`:=`(expression = expression, Prob = Prob)][]
row.id TCGA.CG.4462 expression Prob 1: ENSG00000000003 4.7574661 normal 0.4 2: ENSG00000000005 0.0000000 non 0.2 3: ENSG00000000419 24.1066335 high 0.2 4: ENSG00000000457 2.7631012 normal 0.4 5: ENSG00000000460 0.8928772 low 0.2
library(data.table)
DT <- fread(
"row.id TCGA.CG.4462
ENSG00000000003 4.7574661
ENSG00000000005 0.0000000
ENSG00000000419 24.1066335
ENSG00000000457 2.7631012
ENSG00000000460 0.8928772"
)