这是一个小例子:
X1 <- c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC")
X2 <- c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC")
X3 <- c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA")
mydf1 <- data.frame(X1, X2, X3)
输入数据框
X1 X2 X3
1 AC AC AC
2 AC AC AC
3 AC AC AC
4 CA CA AC
5 TA AT AA
6 AT CA AT
7 CC AC CC
8 CC TC CA
功能
# Function
atgc <- function(x) {
xlate <- c( "AA" = 11, "AC" = 12, "AG" = 13, "AT" = 14,
"CA"= 12, "CC" = 22, "CG"= 23,"CT"= 24,
"GA" = 13, "GC" = 23, "GG"= 33,"GT"= 34,
"TA"= 14, "TC" = 24, "TG"= 34,"TT"=44,
"ID"= 56, "DI"= 56, "DD"= 55, "II"= 66
)
x = xlate[x]
}
outdataframe <- sapply (mydf1, atgc)
outdataframe
X1 X2 X3
AA 11 11 12
AA 11 11 12
AA 11 11 12
AG 13 13 12
CA 12 12 11
AC 12 13 13
AT 14 11 12
AT 14 14 14
问题,AC在输出中不等于12而不是11,类似于其他。只是一团糟!
(Exta:我也不知道如何摆脱rownames。)
答案 0 :(得分:4)
只需使用apply
并转置:
t(apply (mydf1, 1, atgc))
要使用sapply
,请使用:
stringsAsFactors=FALSE
,即
mydf1 <- data.frame(X1, X2, X3, stringsAsFactors=FALSE)
(感谢@joran)或
将函数的最后一行更改为:x = xlate[as.vector(x)]
答案 1 :(得分:1)
`match函数可以使用因子参数和目标匹配向量,即“character”类:
atgc <- function(fac){ c(11, 12, 13, 14,
12, 22, 23, 24,
13, 23, 33, 34,
14, 24, 34,44,
56, 56, 55, 66 )[
match(fac,
c("AA", "AC", "AG", "AT",
"CA", "CC", "CG","CT",
"GA", "GC", "GG","GT" ,
"TA", "TC", "TG","TT",
"ID", "DI", "DD", "II") )
]}
#The match function returns an index that is designed to pull from a vector.
sapply(mydf1, atgc)
X1 X2 X3
[1,] 12 12 12
[2,] 12 12 12
[3,] 12 12 12
[4,] 12 12 12
[5,] 14 14 11
[6,] 14 12 14
[7,] 22 12 22
[8,] 22 24 12
答案 2 :(得分:0)
这样,您只需为矩阵中的每个字母提供替换值,而无需仔细检查以确保您考虑了所有组合并正确匹配它们,尽管您的示例组合有限。
使用值及其替代值定义列表:
trans <- list(c("A","1"),c("C","2"),c("G","3"),c("T","4"),
c("I","6"),c("D","5"))
使用gsub()
atgc2 <- function(myData, x) gsub(x[1], x[2], myData)
使用替换值创建矩阵(在这种情况下,根据mydf1
的需要将gsub()
转换为矩阵返回的字符值,但是您需要检查是否在继续之前使用任何其他数据)
mymat <- Reduce(atgc2, trans, init = as.matrix(mydf1))
mymat
中的值仍然按其最初显示的顺序排列,因此"AC" = "12"
和"CA" = "21"
,因此请对它们进行重新排序(并将它们转换为数值)
ansVec <- sapply( strsplit( mymat, split = ""),
function(x) as.numeric( paste0( sort( as.numeric(x) ), collapse = "")))
对象ansVec
是一个向量,因此将其转换回data.frame
( mydf2 <- data.frame( matrix( ansVec, nrow = nrow(mydf1) ) ) )
# X1 X2 X3
# 1 12 12 12
# 2 12 12 12
# 3 12 12 12
# 4 12 12 12
# 5 14 14 11
# 6 14 12 14
# 7 22 12 22
# 8 22 24 12
对于这种情况,其他答案肯定更快。但是,随着替换操作变得更加复杂,我认为这种解决方案可能会带来一些好处。但是,此方法无法解决的一个方面是检查"ATTGCG"
和"ATT"
的字符串"TTG"
。
答案 3 :(得分:0)
实际上,我认为您希望将原始载体表示为因子,因为它们代表一组有限的水平(DNA二核苷酸),而不是任意的字符值。
lvls = c("AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC",
"GG", "GT", "TA", "TC", "TG", "TT", "ID", "DI", "DD", "II")
X1 <- factor(c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC"), levels=lvls)
X2 <- factor(c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC"), levels=lvls)
X3 <- factor(c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA"), levels=lvls)
mydf1 <- data.frame(X1, X2, X3)
同样,“11”是因子的水平,而不是11的数字。所以级别之间的映射是
xlate <- c("AA" = "11", "AC" = "12", "AG" = "13", "AT" = "14",
"CA"= "12", "CC" = "22", "CG"= "23","CT"= "24",
"GA" = "13", "GC" = "23", "GG"= "33","GT"= "34",
"TA"= "14", "TC" = "24", "TG"= "34","TT"="44",
"ID"= "56", "DI"= "56", "DD"= "55", "II"= "66")
并'重新调整'单个变量
levels(X1) <- xlate
要重新调整数据框的所有列,
as.data.frame(lapply(mydf1, `levels<-`, xlate))
使用sapply
是不合适的,因为即使您已将其命名为outdataframe
,也会创建一个矩阵(字符)。这种区别对于这可能代表的SNP数据实际上可能是重要的,因为作为矩阵的1000个样本中的数百万个SNP将被实现为长度超过R可以存储的最长向量的单个向量(模数大向量支持被引入到R-devel),而数据框将是每个只有数百万个元素的向量列表。