我目前正在尝试对来自RNA表达式的一些log2cpm数据运行PCA。我已经完成了以下数据预处理:
设置用于控制和治疗的数据集:
dataset <- read.table("log2cpm.txt", sep="\t", header = TRUE, row.names = NULL) %>% na.omit()#dataset
dataset <- dataset[!duplicated(dataset$hgnc_symbol), ]
row.names(dataset) <- dataset$hgnc_symbol
#Set genedabase
gene_DB <- read.table("TableS1.txt", sep="\t", header = TRUE) #selection
gene_DB <- gene_DB[!duplicated(gene_DB$Symbol), ]
row.names(gene_DB) <- gene_DB$Symbol
然后我过滤了基因:
#Filter genes from dataset based on imported database
dataset_filtered <- dataset %>% filter(hgnc_symbol %in% gene_DB$Symbol)
我进一步转置(翻转)了数据框并将其转换为矩阵:
data_tsc <- t(as.matrix(dataset_filtered))
colnames(data_tsc) <- c(data_tsc[2,1:ncol(data_tsc)])
data_tsc <- data_tsc[c(-1,-2),]
您可以在代码中看到,我总是尝试保留行名(样本)和col_names(基因),以便在处理PCA和数据时能有所作为,并跟踪300+基因。
但是,当我通过PCA分析运行矩阵(data_tsc)时,此方法不起作用:
#Run PCA####
pca <- prcomp(data_tsc[,c(1:ncol(data_tsc))], center = TRUE,scale. = TRUE)
这将返回:
colMeans(x,na.rm = TRUE)错误:“ x”必须为数字
经过严格的谷歌搜索后,我发现了问题所在:as.matrix
和t()
之前完成过将数值转换为chr
。
我已经尝试使用诸如apply,lapply,as.numeric等之类的功能来多次纠正这一问题。我一直在搜索大量信息,所有提出的解决方案要么弄乱了我的行和列,要么破坏整个数据集。
因此,有一种简便快捷的解决方案,可以将chr值转换为数字,同时又不浪费行和列吗? :D
Ps。我只是在学习编码,但是遇到了一些问题。
更改:
NelsonGon要求我提供以下信息:
dput(head(data_tsc))
哪个回来了
structure(c("4,891962697", "4,807689723", "5,07457417", "5,086369154",
"4,914961379", "4,83431453", "6,583923027", "6,482957338", "6,587420199",
"6,532262901", "6,438933039", "6,448834899", "2,832721409", "2,881398092",
"2,389231753", "2,780670224", "2,417835957", "2,761576388", "7,494008371",
"7,58143903", "7,62969704", "7,579694323", "7,438227488", "7,513190279",
"6,257073157", "6,351044394", "6,313216639", "6,597298125", "6,112566161",
"6,315617767", "6,822914122", "6,660904066", "6,925653718", "7,379973187",
"6,804033651", "6,443382931", "5,271577287", "5,510134745", "5,418971124",
"5,551120518", "5,302474278", "5,552416478", "5,165993558", "5,030291607",
"5,145076323", "4,905049925", "5,202651513", "5,250135996", "2,827019018",
"2,626020468", "2,702723667", "2,575260635", "2,30347029", "2,449794083",
"5,866824758", "5,881522359", "5,913145862", "5,922174742", "5,869024665",
"5,896680873"), .Dim = c(6L, 10L), .Dimnames = list(c("LIG_UT_1",
"LIG_UT_2", "LIG_UT_3", "LIG_UT_4", "LIG_UT_5", "LIG_UT_6"),
c("ACVR1", "ADAM17", "AGER", "AKT1", "ANPEP", "ANXA1", "AR",
"ATM", "AURKA", "AXIN1")))
第二条建议后的更改: 我在read.table()
中更改了它dataset <- read.table("log2cpm.txt", sep="\t", header = TRUE, row.names = NULL, dec = ",")
指定dec =“,”
这在dput中给出了以下输出:
structure(c(" 4.8919627", " 4.8076897", " 5.0745742", " 5.0863692",
“ 4.9149614”,“ 4.8343145”,“ 6.5839230”,“ 6.4829573”,“ 6.5874202”, “ 6.5322629”,“ 6.4389330”,“ 6.4488349”,“ 2.8327214”,“ 2.8813981”, “ 2.3892318”,“ 2.7806702”,“ 2.4178360”,“ 2.7615764”,“ 7.4940084”, “ 7.5814390”,“ 7.6296970”,“ 7.5796943”,“ 7.4382275”,“ 7.5131903”, “ 6.2570732”,“ 6.3510444”,“ 6.3132166”,“ 6.5972981”,“ 6.1125662”, “ 6.3156178”,“ 6.8229141”,“ 6.6609041”,“ 6.9256537”,“ 7.3799732”, “ 6.8040337”,“ 6.4433829”,“ 5.2715773”,“ 5.5101347”,“ 5.4189711”, “ 5.5511205”,“ 5.3024743”,“ 5.5524165”,“ 5.1659936”,“ 5.0302916”, “ 5.1450763”,“ 4.9050499”,“ 5.2026515”,“ 5.2501360”,“ 2.8270190”, “ 2.6260205”,“ 2.7027237”,“ 2.5752606”,“ 2.3034703”,“ 2.4497941”, “ 5.8668248”,“ 5.8815224”,“ 5.9131459”,“ 5.9221747”,“ 5.8690247”, “ 5.8966809”),.Dim = c(6L,10L),.Dimnames = list(c(“ LIG_UT_1”, “ LIG_UT_2”,“ LIG_UT_3”,“ LIG_UT_4”,“ LIG_UT_5”,“ LIG_UT_6”), c(“ ACVR1”,“ ADAM17”,“ AGER”,“ AKT1”,“ ANPEP”,“ ANXA1”,“ AR”, “ ATM”,“ AURKA”,“ AXIN1”))))
解决方案
Based on Adams suggestion prrevious suggestion to add dec = "," in read.table, and to afterwards use use the following code:
dataset_numeric <- apply(data_tsc, 2, as.numeric)
rownames(data_numeric) <- rownames(data_tsc)
colMeans(data_tsc)
我设法将字符值转换为数字,同时仍然保留行和colulms。 PCA工作了,并且:
is.numeric(dataset_numeric)
[1]是
谢谢大家的帮助,我正因为沮丧而把头发扯下来。
答案 0 :(得分:1)
问题可能是小数点是逗号而不是句点。尝试先进行转换。
dataset_numeric <- sub(",",".",dataset)
完成后,这应该非常简单。如果从此处开始,则可能是以下内容的重复,并增加了行名的要求。
Convert character matrix into numeric matrix
因此,在这种情况下,您可以进行一些修改:
dataset_numeric <- apply(dataset_numeric, 2, as.numeric)
rownames(dataset_numeric) <- rownames(dataset)
或选择此选项:
class(dataset_numeric) <- "numeric"
要测试:
prcomp(dataset_numeric, center = TRUE, scale = TRUE)
运行没有错误:
Standard deviations (1, .., p=6):
[1] 2.191373e+00 1.464462e+00 1.331818e+00 1.002092e+00 5.246949e-01 3.755055e-15
Rotation (n x k) = (10 x 6):
PC1 PC2 PC3 PC4 PC5 PC6
ACVR1 -0.33509491 -0.32378624 0.35207791 0.04650037 -0.22465986 -0.07403592
ADAM17 -0.26169241 -0.47259488 -0.30394898 -0.13763357 -0.18328981 0.41562880
AGER -0.07354562 0.38073508 -0.56645061 0.26681868 -0.28597500 0.12602119
AKT1 -0.37111066 0.01674254 -0.07923664 -0.48941844 0.56009962 0.31877982
ANPEP -0.41234145 0.25398752 -0.06276181 0.12397346 -0.28744359 0.12200886
ANXA1 -0.34908735 -0.20718967 0.18610579 0.51004989 -0.01539492 0.28629143
AR -0.23808868 0.54584757 0.08481153 -0.27218135 -0.07711181 0.16714943
ATM 0.37104240 -0.14079095 0.04995052 -0.44945864 -0.56884559 0.33723134
AURKA -0.20262305 -0.29758992 -0.57407802 -0.16727601 -0.03025329 -0.51762461
AXIN1 -0.38573848 0.11317416 0.28050560 -0.29761514 -0.32731009 -0.44477115