这个问题很难解释,但让我告诉你我想从这些数据中得到什么。所以,我有一个包含20个不同列的数据,其中有两个已在此处显示。
Sequence modifications
AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)
AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)
AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE [7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)
AAIYKLLKSHFRNE [5] Biotin (K)|[8] Acetyl (K)
AAKKFEE [3] Acetyl (K)|[4] Acetyl (K)
正如您在同一序列中看到的那样,可能会有不同的修改。有时可能有3x乙酰基,simetimes 2x乙酰基,有时只有一次,在其他情况下不会有任何修饰。我对“生物素和乙酰基”感兴趣只有2个修改,其他修改并不重要。修饰的数量取决于序列中“K”的数量。例如,如果序列中有3个“K”,则可能的修改数量为0 0,1,2,3且不超过3。 因此,我想根据序列中“K”的数量以及它所具有的修改的数量和类型对这些序列(1000行)进行分组,而不会粉碎其他列。
我想通过R从这些数据中得到什么,它是具有指定修改的序列的不同组。例如:
First Group: (number of "K" in the sequence = 2, and both modified by acetyl)
Sequence modifications
AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)
AAIYKLLKSHFRNE [5] Acetyl (K)|[8] Acetyl (K)
Second Group: (number of "K" in the sequence = 2, and one modified by acetyl, second nothing)
Third Group: (number of "K" in the sequence = 3, and one modified by acetyl, second acetyl, and last is biotin)
我必须包括所有可能性。这就是我认为在我试图编写的脚本的这个“部分”上最好的东西。也许您有任何其他建议如何插入这些数据。
第二个问题是: 我计算了3个不同列中的值的平均值,我想将结果放在相同的数据中但在另一列中。怎么做?
tbl_imp$mean <- rowMeans(subset(tbl_imp, select = c("x", "y", "w")), na.rm = TRUE)
tbl_imp$mean <- data.frame(tbl_imp$mean)
我用来计算行的平均值的代码。我只是不知道如何在我拥有的数据中创建一个新列,并将我的结果放在那里。我应该使用变换函数吗?
答案 0 :(得分:0)
这样的事情可能适用于你的第一部分。我现在无法下载文件,但是当我可以的时候,我会尝试回复第二部分。
library(data.table)
library(stringr)
# Slightly modified dataset
dataset <- data.table(
Sequence = c(
'AAAAGAAAVANQGKK'
,'AAAAGAAAVANQGKK'
,'AAIKFIKFINPKINDGE'
,'AAIKFIKFINPKINDGE'
,'AAIKFIKFINPKINDGE'
,'AAIKFIKFINPKINDGE'
,'AAIYKLLKSHFRNE'
,'AAKKFEE'
),
modifications = c(
'[14] Acetyl (K)|[15] Acetyl (K)'
,'[14] Acetyl (K)|[15] Acetyl (K)'
,'[4] Acetyl (K)|[7] Something (K)|[12] Acetyl (K)'
,'[4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)'
,'[7] Acetyl (K)|[12] Acetyl (K)'
,'[4] Acetyl (K)|[7] Acetyl (K)'
,'[5] Biotin (K)|[8] Acetyl (K)'
,'[3] Acetyl (K)'
)
)
# get the 1st, 2nd, 3rd modifications in separate columns
dataset <- data.table(cbind(
dataset,
str_split_fixed(dataset[,modifications], pattern = "\\(K\\)",3)
))
dataset[,':='(
V1 = as.character(V1),
V2 = as.character(V2),
V3 = as.character(V3)
)]
# Count of modifications
dataset[, NoOfKs := 3]
dataset[V3 == "", NoOfKs := 2]
dataset[V2 == "", NoOfKs := 1]
dataset[V1 == "", NoOfKs := 0]
# Retaining Acetyl/Biotin or no modification only
dataset[, AB01 := TRUE]
dataset[, AB02 := TRUE]
dataset[, AB03 := TRUE]
dataset[V1 != "", AB01 := grepl(V1, pattern = "Acetyl|Biotin")]
dataset[V2 != "", AB02 := grepl(V2, pattern = "Acetyl|Biotin")]
dataset[V3 != "", AB03 := grepl(V3, pattern = "Acetyl|Biotin")]
dataset <- dataset[AB01 & AB02 & AB03]
# Marking each modification as acetyl/biotin/none
dataset[V1 != " " & grepl(V1, pattern = "Acetyl"), AB1 := "Acetyl"]
dataset[V1 != " " & grepl(V1, pattern = "Biotin"), AB1 := "Biotin"]
dataset[V2 != " " & grepl(V2, pattern = "Acetyl"), AB2 := "Acetyl"]
dataset[V2 != " " & grepl(V2, pattern = "Biotin"), AB2 := "Biotin"]
dataset[V3 != " " & grepl(V3, pattern = "Acetyl"), AB3 := "Acetyl"]
dataset[V3 != " " & grepl(V3, pattern = "Biotin"), AB3 := "Biotin"]
dataset[
,
list(
Sequence = Sequence,
modifications = modifications,
GroupID = .GRP
),
by = c('NoOfKs','AB1','AB2','AB3')
]
输出
NoOfKs AB1 AB2 AB3 Sequence modifications GroupID
1: 2 Acetyl Acetyl NA AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 1
2: 2 Acetyl Acetyl NA AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 1
3: 2 Acetyl Acetyl NA AAIKFIKFINPKINDGE [7] Acetyl (K)|[12] Acetyl (K) 1
4: 2 Acetyl Acetyl NA AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K) 1
5: 3 Acetyl Acetyl Acetyl AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K) 2
6: 2 Biotin Acetyl NA AAIYKLLKSHFRNE [5] Biotin (K)|[8] Acetyl (K) 3
7: 1 Acetyl NA NA AAKKFEE [3] Acetyl (K) 4
答案 1 :(得分:0)
我将您的数据加载为对象aa
。
mydata <- data.frame(seqs = aa$Sequence, mods = aa$modifications) # subset of aa with sequences and modifications
##to find number of "K"s
spl_seqs <- strsplit(as.character(mydata$seqs), split = "") # split all sequences (use "as.character" because they are turned into factor)
where_K <- lapply(spl_seqs, grep, pattern = "K") # find positions of "K"s in each sequence
No_K <- lapply(where_K, length) # count "K"s in each sequence
mydata$No_Ks <- No_K #add a column that informs about the number of "K"s in each sequence
##
我认为所有看似“修改”列的大写字母都是指正在进行的修改或“K”。我想不出任何其他方法来简化“修改”列以便操纵它们。所以我只是保留不是“K”的大写字母:
names(LETTERS) <- LETTERS # DWin's idea in this http://stackoverflow.com/questions/4423460/is-there-a-function-to-find-all-lower-case-letters-in-a-character-vector
spl_mods <- strsplit(as.character(mydata$mods), split = "") # split the characters in each modification row
简化修改列,仅保留每个修改的第一个字母:
mods_ls <- vector("list", length = nrow(mydata)) #list to fill with simplified modifications
for(i in 1:length(spl_mods))
{
res <- as.character(na.omit(LETTERS[strsplit(as.character(mydata$mods), split = "")[[i]]])) #keep only upper-case letters
res <- as.character(na.omit(gsub("K", NA, res))) # exclude "K"s
res <- as.character(na.omit(gsub("M", NA, res))) # and "M"s I guessed
mods_ls[[i]] <- res
}
mydata$simplified_mods <- unlist(lapply(mods_ls, paste, collapse = " ; "))
到目前为止我们得到了什么:
mydata[1:10,]
# seqs mods No_Ks simplified_mods
#1 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 2 A ; A
#2 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 2 A ; A
#3 AAFTKLDQVWGSE [5] Acetyl (K) 1 A
#4 AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K) 3 A ; A ; A
#5 AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K) 3 A ; A ; A
#6 AAIKFIKFINPKINDGE [7] Acetyl (K)|[12] Acetyl (K) 3 A ; A
#7 AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K) 3 A ; A
#8 AAIYKLLKSHFRNE [5] Biotin (K)|[8] Acetyl (K) 2 B ; A
#9 AAKKFEE [3] Acetyl (K)|[4] Acetyl (K) 2 A ; A
#10 AAKYFRE [3] Acetyl (K) 1 A
然后,您可以对“K”的数量和所需的特定修改进行子集化。 E.g:
how_many_K <- 2
what_mods <- "A ; A" #separated by [space];[space]
show_rows <- which(mydata$No_Ks == how_many_K & mydata$simplified_mods == what_mods)
mydata[show_rows,]
# seqs mods No_Ks simplified_mods
#1 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 2 A ; A
#2 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 2 A ; A
#9 AAKKFEE [3] Acetyl (K)|[4] Acetyl (K) 2 A ; A
#11 AANVKKTLVE [5] Acetyl (K)|[6] Acetyl (K) 2 A ; A
#14 AARDSKSPIILQTSNGGAAYFAGKGISNE [6] Acetyl (K)|[24] Acetyl (K) 2 A ; A
#20 AEKLKAE [3] Acetyl (K)|[5] Acetyl (K) 2 A ; A
#21
#....
编辑:所有这些都可以在像fun
这样的函数中完成。 x
是data.frame
(与Henrik一起上传的structure
)。 noK
是您想要的“K”数。 mod
是你希望用[space]; [space]分隔的修改(例如“B; A; O”)。:
fun <- function(x, noK, no_modK = NULL, mod = NULL) #EDIT_1e: update arguments; made optional
{
mydata <- data.frame(seqs = x$Sequence, mods = x$modifications)
spl_seqs <- strsplit(as.character(mydata$seqs), split = "")
where_K <- lapply(spl_seqs, grep, pattern = "K")
No_K <- lapply(where_K, length)
mydata$No_Ks <- No_K
names(LETTERS) <- LETTERS
spl_mods <- strsplit(as.character(mydata$mods), split = "")
mods_ls <- vector("list", length = nrow(mydata))
for(i in 1:length(spl_mods))
{
res <- as.character(na.omit(LETTERS[strsplit(as.character(mydata$mods), split = "")[[i]]]))
no_modedK <- length(grep("K", res)) #EDIT_1a: how many "K"s are modified?
res <- as.character(na.omit(gsub("K", NA, res)))
res <- as.character(na.omit(gsub("M", NA, res)))
mods_ls[[i]] <- list(mods = res, modified_K = no_modedK) #EDIT_1b: catch number of "K"s modified (along with the actual modifications)
}
mydata$no_modK <- unlist(lapply(lapply(lapply(mods_ls, `[`, 2), unlist), paste, collapse = " ; ")) #EDIT_1d: insert number of modified "K"s in "mydata"
mydata$simplified_mods <- unlist(lapply(lapply(lapply(mods_ls, `[`, 1), unlist), paste, collapse = " ; ")) #EDIT_1c: insert mods in "mydata"
if(!is.null(no_modK) & !is.null(mod)) #EDIT_1f: update "return"
{
show_rows <- which(mydata$No_Ks == noK & mydata$no_modK == no_modK & mydata$simplified_mods == mod)
}
if(is.null(no_modK) & !is.null(mod))
{
show_rows <- which(mydata$No_Ks == noK & mydata$simplified_mods == mod)
}
if(is.null(mod) & !is.null(no_modK))
{
show_rows <- which(mydata$No_Ks == noK & mydata$no_modK == no_modK)
}
if(is.null(no_modK) & is.null(mod))
{
show_rows <- which(mydata$No_Ks == noK)
}
return(mydata[show_rows,])
}
E.g:
fun(aa, noK = 3) #aa is the the "for Henrik" loaded in `R` (aa <- structure( ... )
seqs mods No_Ks no_modK simplified_mods
4 AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K) 3 3 A ; A ; A
5 AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K) 3 3 A ; A ; A
6 AAIKFIKFINPKINDGE [7] Acetyl (K)|[12] Acetyl (K) 3 2 A ; A
#...
fun(aa, noK = 3, no_modK = 2)
seqs mods No_Ks no_modK simplified_mods
6 AAIKFIKFINPKINDGE [7] Acetyl (K)|[12] Acetyl (K) 3 2 A ; A
7 AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K) 3 2 A ; A
#...
fun(aa, noK = 2, mod = "A ; B")
seqs mods No_Ks no_modK simplified_mods
200 ISAMVLTKMKE [8] Acetyl (K)|[10] Biotin (K) 2 2 A ; B
441 NLKPSKPSYYLDPE [3] Acetyl (K)|[6] Biotin (K) 2 2 A ; B
#...
fun(aa, noK = 2, no_modK = 1, mod = "A")
seqs mods No_Ks no_modK simplified_mods
15 AARDSKSPIILQTSNGGAAYFAGKGISNE [24] Acetyl (K) 2 1 A
27 AKALVAQGVKFIAE [2] Acetyl (K) 2 1 A
#...
EDIT_1:更新了fun
和示例。