如何从同一字符串中提取多个值并将其放入R中的矩阵

时间:2018-08-20 18:34:59

标签: r string split

在表格的一个字段中,有一个类似数百个样本的字符串:

从急性髓样患者的骨髓或外周血抽吸物中纯化出冲击波和单核细胞。样品中含有80-100%的胚细胞。用异硫氰酸胍裂解后提取总RNA,然后氯化铯进行梯度纯化。 FAB = M1 ,核型= “ t(8; 21)” ,FLT3 ITD = pos ,FLT3 TKD = pos ,N-RAS = ,K-RAS = ,EVI1 = ,cEBPa =

前两句中的斜体文本是相同的,因此我想删除它。在最后一句话中,我想提取与不同分类相关的每个粗体值,并将它们放入矩阵的单独字段中,其中 M1 将为[1,1], t(8; 21)为[2,1], pos 为[3,1],依此类推,但我不确定从何处开始,尤其是因为缺少某些值(例如,FAB =,Karyotype =等), Karyotype 字段中的值有时如上例所示用引号引起来,而其他时候则为数字且可能包含特殊字符,例如 -7

任何建议将不胜感激。

2 个答案:

答案 0 :(得分:1)

这是一个非常直接的方法。在输入字符串的两个副本上进行了演示。

input = 'Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1, Karyotype="t(8;21)", FLT3 ITD=pos, FLT3 TKD=pos, N-RAS=neg, K-RAS=neg, EVI1=neg, cEBPa=neg'
input = rep(input, 2)

#remove everything up through "purification. "
result = sub(pattern = ".*purification\\. ", replacement = "", x = input)
# split by commas:
result = strsplit(result, split = ", ")
# delete everything through "="
result = lapply(result, sub, pattern = ".*=", replacement = "")

do.call(rbind, result)
#      [,1] [,2]          [,3]  [,4]  [,5]  [,6]  [,7]  [,8] 
# [1,] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"
# [2,] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"

如果要切换行和列,请在最后一步中使用cbind代替rbind。无论Karyotype是否带有引号,这都应该可以正常工作。只要丢失的数据上存在丢失的数据(例如"FAB=, Karyotype..."),它就会处理丢失的数据,并填充一个空字符串。您可能需要用NA替换空字符串,作为附加步骤。

答案 1 :(得分:0)

另一个方法是使用正则表达式,例如:

string <- 'Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1, Karyotype="t(8;21)", FLT3 ITD=pos, FLT3 TKD=pos, N-RAS=neg, K-RAS=neg, EVI1=neg, cEBPa=neg'

1.-首先,您用split,字符串

string <- strplit(string,',')[[1]]

## You will have a vector like this:

## [1] "Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1"
## [2] " Karyotype=\"t(8;21)\""                                                                                                                                                                                                                                                                     
## [3] " FLT3 ITD=pos"                                                                                                                                                                                                                                                                              
## [4] " FLT3 TKD=pos"                                                                                                                                                                                                                                                                              
## [5] " N-RAS=neg"                                                                                                                                                                                                                                                                                 
## [6] " K-RAS=neg"                                                                                                                                                                                                                                                                                 
## [7] " EVI1=neg"                                                                                                                                                                                                                                                                                  
## [8] " cEBPa=neg" 

2.-用=和正则表达式删除gsub之前的所有字符:

gsub(".*=","",string)

## The result is a vector with "clean" data.

## [1] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg" 

模式.*=表示它将获得=之前的所有字符,并使用gsub将其替换为没有字符("")。

有关R中正则表达式的更多信息,请使用此link,对我有很大帮助。

希望这会有所帮助。

致谢