在表格的一个字段中,有一个类似数百个样本的字符串:
从急性髓样患者的骨髓或外周血抽吸物中纯化出冲击波和单核细胞。样品中含有80-100%的胚细胞。用异硫氰酸胍裂解后提取总RNA,然后氯化铯进行梯度纯化。 FAB = M1 ,核型= “ t(8; 21)” ,FLT3 ITD = pos ,FLT3 TKD = pos ,N-RAS = 负,K-RAS = 负,EVI1 = 否,cEBPa = 否前两句中的斜体文本是相同的,因此我想删除它。在最后一句话中,我想提取与不同分类相关的每个粗体值,并将它们放入矩阵的单独字段中,其中 M1 将为[1,1], t(8; 21)为[2,1], pos 为[3,1],依此类推,但我不确定从何处开始,尤其是因为缺少某些值(例如,FAB =,Karyotype =等), Karyotype 字段中的值有时如上例所示用引号引起来,而其他时候则为数字且可能包含特殊字符,例如 -7 。
任何建议将不胜感激。
答案 0 :(得分:1)
这是一个非常直接的方法。在输入字符串的两个副本上进行了演示。
input = 'Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1, Karyotype="t(8;21)", FLT3 ITD=pos, FLT3 TKD=pos, N-RAS=neg, K-RAS=neg, EVI1=neg, cEBPa=neg'
input = rep(input, 2)
#remove everything up through "purification. "
result = sub(pattern = ".*purification\\. ", replacement = "", x = input)
# split by commas:
result = strsplit(result, split = ", ")
# delete everything through "="
result = lapply(result, sub, pattern = ".*=", replacement = "")
do.call(rbind, result)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"
# [2,] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"
如果要切换行和列,请在最后一步中使用cbind
代替rbind
。无论Karyotype是否带有引号,这都应该可以正常工作。只要丢失的数据上存在丢失的数据(例如"FAB=, Karyotype..."
),它就会处理丢失的数据,并填充一个空字符串。您可能需要用NA
替换空字符串,作为附加步骤。
答案 1 :(得分:0)
另一个方法是使用正则表达式,例如:
string <- 'Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1, Karyotype="t(8;21)", FLT3 ITD=pos, FLT3 TKD=pos, N-RAS=neg, K-RAS=neg, EVI1=neg, cEBPa=neg'
1.-首先,您用split
来,
字符串
string <- strplit(string,',')[[1]]
## You will have a vector like this:
## [1] "Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1"
## [2] " Karyotype=\"t(8;21)\""
## [3] " FLT3 ITD=pos"
## [4] " FLT3 TKD=pos"
## [5] " N-RAS=neg"
## [6] " K-RAS=neg"
## [7] " EVI1=neg"
## [8] " cEBPa=neg"
2.-用=
和正则表达式删除gsub
之前的所有字符:
gsub(".*=","",string)
## The result is a vector with "clean" data.
## [1] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"
模式.*=
表示它将获得=
之前的所有字符,并使用gsub
将其替换为没有字符(""
)。
有关R
中正则表达式的更多信息,请使用此link,对我有很大帮助。
希望这会有所帮助。
致谢