Question

在表格的一个字段中，有一个类似数百个样本的字符串：

从急性髓样患者的骨髓或外周血抽吸物中纯化出冲击波和单核细胞。样品中含有80-100％的胚细胞。用异硫氰酸胍裂解后提取总RNA，然后氯化铯进行梯度纯化。 FAB = M1 ，核型= “ t（8; 21）” ，FLT3 ITD = pos ，FLT3 TKD = pos ，N-RAS = 负，K-RAS = 负，EVI1 = 否，cEBPa = 否

前两句中的斜体文本是相同的，因此我想删除它。在最后一句话中，我想提取与不同分类相关的每个粗体值，并将它们放入矩阵的单独字段中，其中 M1 将为[1,1]， t（8; 21）为[2,1]， pos 为[3,1]，依此类推，但我不确定从何处开始，尤其是因为缺少某些值（例如，FAB =，Karyotype =等）， Karyotype 字段中的值有时如上例所示用引号引起来，而其他时候则为数字且可能包含特殊字符，例如 -7 。

任何建议将不胜感激。

Answer 1

这是一个非常直接的方法。在输入字符串的两个副本上进行了演示。

input = 'Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1, Karyotype="t(8;21)", FLT3 ITD=pos, FLT3 TKD=pos, N-RAS=neg, K-RAS=neg, EVI1=neg, cEBPa=neg'
input = rep(input, 2)

#remove everything up through "purification. "
result = sub(pattern = ".*purification\\. ", replacement = "", x = input)
# split by commas:
result = strsplit(result, split = ", ")
# delete everything through "="
result = lapply(result, sub, pattern = ".*=", replacement = "")

do.call(rbind, result)
#      [,1] [,2]          [,3]  [,4]  [,5]  [,6]  [,7]  [,8] 
# [1,] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"
# [2,] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"

如果要切换行和列，请在最后一步中使用cbind代替rbind。无论Karyotype是否带有引号，这都应该可以正常工作。只要丢失的数据上存在丢失的数据（例如"FAB=, Karyotype..."），它就会处理丢失的数据，并填充一个空字符串。您可能需要用NA替换空字符串，作为附加步骤。

Answer 2

另一个方法是使用正则表达式，例如：

string <- 'Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1, Karyotype="t(8;21)", FLT3 ITD=pos, FLT3 TKD=pos, N-RAS=neg, K-RAS=neg, EVI1=neg, cEBPa=neg'

1.-首先，您用split来,字符串

string <- strplit(string,',')[[1]]

## You will have a vector like this:

## [1] "Blasts and mononuclear cells were purified from bone marrow or peripheral blood aspirates of acute myeloid patients. Samples contained 80-100 percent blast cells. Total RNA was extracted by lyses with guanidium isothiocyanate followed by cesium chloride gradient purification. FAB=M1"
## [2] " Karyotype=\"t(8;21)\""                                                                                                                                                                                                                                                                     
## [3] " FLT3 ITD=pos"                                                                                                                                                                                                                                                                              
## [4] " FLT3 TKD=pos"                                                                                                                                                                                                                                                                              
## [5] " N-RAS=neg"                                                                                                                                                                                                                                                                                 
## [6] " K-RAS=neg"                                                                                                                                                                                                                                                                                 
## [7] " EVI1=neg"                                                                                                                                                                                                                                                                                  
## [8] " cEBPa=neg"

2.-用=和正则表达式删除gsub之前的所有字符：

gsub(".*=","",string)

## The result is a vector with "clean" data.

## [1] "M1" "\"t(8;21)\"" "pos" "pos" "neg" "neg" "neg" "neg"

模式.*=表示它将获得=之前的所有字符，并使用gsub将其替换为没有字符（""）。

有关R中正则表达式的更多信息，请使用此link，对我有很大帮助。

希望这会有所帮助。

致谢

如何从同一字符串中提取多个值并将其放入R中的矩阵

2 个答案: