我有一个包含36列的文件,每个第二列包含基因符号,每个第一列包含该符号的TPM值,这是根据每个转录本计算的,该转录本位于每个第三列中。
这意味着每个第二列中的基因符号可以在下一个细胞中重复,并且基于该基因的转录物的数量,它可以针对不同的基因符号发生不同的次数。 我想在R中运行for循环,以便为相同的基因符号总结所有TPM并将其移动到新的数据框中。
我的代码是:
for (i in 1:12)
{
for (j in 2:length(df$ref_gene_name.i))
{for (k in 2:length(df$ref_gene_name.i))
{ if (df$ref_gene_name.i[k] == df$ref_gene_name.i[k+1])
{df1$ref_gene_name.i[j] <- df$ref_gene_name.i[k]}
df1$TPM.i[j] <- df$TPM.i[k] + df$TPM.i[k+1]
}
}
}
当我运行它时,我收到错误消息: if中的错误(df $ ref_gene_name.i [k] == df $ ref_gene_name.i [k + 1]){: 参数长度为零。 检查单个步骤是否有错误:
k=5
df$ref_gene_name.0[k]
df$ref_gene_name.0[k] == df$ref_gene_name.0[k+2]
似乎工作并返回正确的值,如果它不是同一个符号则为False,如果它是相同的符号则为真。
不确定我的错误在哪里,感谢任何帮助。
数据看起来像这样:
答案 0 :(得分:0)
这个怎么样:
library(dplyr)
# Example Data (NA to simulate a partial line)
df <- data.frame("TPM"=c(0.005,0.0008,0.075),"GeneName"=c("OCT4","TERT","TERT"),"Transcript"=c("a","a","b"),
"TPM2"=c(0.005,0.0008,NA),"GeneName2"=c("OCT4","TERT",NA),"Transcript2"=c("a","a",NA))
# New data Frame, 1 column per data type
df2 <- data.frame(colnames(c("TPM","GeneName","Transcript")))
for (i in 1:(ncol(df)/3)){
e <- i*3
s <- e-2
dfn <- df[,s:e]
colnames(dfn) <- c("TPM","GeneName","Transcript")
df2 <- rbind(df2,dfn)
}
# group by gene name, sum the TPM values by gene name group and ommit any missing values from incomplete lines.
df2 %>% group_by(GeneName) %>% summarise("sumTPM"=sum(TPM)) %>% na.omit()
答案 1 :(得分:0)
这可能需要稍微调整一下,但这些方面的内容应该有效..
for (i in 0:11)
{
for (j in unique(df[,paste0("ref_gene_name.",i)]))
{
print(sum(df[df[,paste0("ref_gene_name.",i)]==j, paste0("TPM.",i)], na.rm=T))
}
}
答案 2 :(得分:0)
假设数据结构为以下随机数据(为再现性而播种),请考虑以下内容:在列中,然后在列中:
数据 (其中基因名称为统计/数字,已关闭/开源,程序/语言)
gene_name <- c("SAS", "Stata", "SPSS", "Julia", "R", "Pandas")
set.seed(41918)
df <- data.frame(
TPM.0 = abs(rnorm(50))*100,
transcript_id.0 = replicate(50, paste(replicate(10, sample(LETTERS , 1, replace=TRUE)), collapse="")),
ref_gene_name.0 = replicate(50, sample(gene_name , 1, replace=TRUE)),
TPM.1 = abs(rnorm(50))*100,
transcript_id.1 = replicate(50, paste(replicate(10, sample(LETTERS , 1, replace=TRUE)), collapse="")),
ref_gene_name.1 = replicate(50, sample(gene_name , 1, replace=TRUE)),
TPM.2 = abs(rnorm(50))*100,
transcript_id.2 = replicate(50, paste(replicate(10, sample(LETTERS , 1, replace=TRUE)), collapse="")),
ref_gene_name.2 = replicate(50, sample(gene_name , 1, replace=TRUE))
)
head(df)
# TPM.0 transcript_id.0 ref_gene_name.0 TPM.1 transcript_id.1 ref_gene_name.1 TPM.2 transcript_id.2 ref_gene_name.2
# 1 86.142687 YVXKYYGWBA Stata 139.16500 IYIJLZITLR SPSS 42.39001 LFCAKYBJKI SPSS
# 2 133.150120 YZGWGGFKXG SPSS 19.46897 TULSBXMZPE SAS 88.39766 AUSWZRNRNZ Stata
# 3 139.804035 ZHPLNRNYWN Pandas 166.69469 WLUNYEPGAQ R 103.52094 CRERVAUSDU SPSS
# 4 146.847943 OTKELYDWDC SPSS 66.93809 LLOCPRBUZS R 62.43820 QZYZINREYO SAS
# 5 89.437472 NMAHZLRXJX SPSS 49.17413 VCEDDIBJHA Julia 148.03048 LTHJEDOPDB Julia
# 6 5.584601 WJLKHEBYYB Stata 88.22947 RERMEUCXGL SPSS 61.42689 HHGRPSVALV SAS
处理
df$X <- NULL # BE SURE TO REMOVE ANYTHING BEFORE FIRST TPM
# LIST OF DATAFRAMES (EVERY 3 COLUMNS)
df_list <- lapply(seq(1, ncol(df), 3), function(i) {
tmp <- df[, c(i,(i+2))]
# NORMALIZE GENE INDICATOR COLUMN NAME
colnames(tmp)[2] <- "ref_gene_name"
# WITHIN SUM
aggregate(.~ref_gene_name, tmp, FUN=sum)
})
# CHAIN MERGE ACROSS ALL DATAFRAMES
wide_df <- Reduce(function(x, y) merge(x, y, by="ref_gene_name", all.x=TRUE), df_list)
# ACROSS SUM: ALL TPM COLUMNS
wide_df$TPM_All <- Reduce(`+`, wide_df[grep("TPM", names(wide_df))])
wide_df
# ref_gene_name TPM.0 TPM.1 TPM.2 TPM_All
# 1 Julia 1284.8478 649.3629 1250.2410 3184.452
# 2 Pandas 530.0559 590.9631 873.6411 1994.660
# 3 R 538.8770 509.3850 254.7034 1302.965
# 4 SAS 287.0210 645.4013 587.1971 1519.619
# 5 SPSS 659.0406 1008.8625 902.4517 2570.355
# 6 Stata 1095.2571 925.9412 781.9734 2803.172