我正在使用R来处理data.frame
;一列有一定的字母和数字混合,我想在一个字符模式之间加一个逗号:
输入:
arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3
arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat
arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat
期望的输出:
arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3
arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat
arr 11p15.5(2097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat
基本上,我想在第一个(xxx-xxx)x1
之后加一个逗号(这里可能是x1,x2,x3,然后可能有一个" mat"," pat&#34 ; x1之后)。
非常感谢MichaelChirico和Onyambu,我从该专栏中提取了更多内容,
输入' arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3',' arr 11p15.5(2097357) -2432381)x211p15.4(3224902-4383881)x1 pat',' arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat',' arr [hg19] Xp22.33p22.12(60701-21536551)x1~3 Xq21.31q28(90731177-155208244)x1 ish',' arr 11p15.5(2097357-2432381)x3,11p15.4(3424982 -4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)',' nuc ish(D21S259 / D21S341 / D21S342x3).arr(21)x310q26.12(121812494-122486677)x1'
输出' arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3',' arr 11p15.5 (2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat',' arr 11p15.5(2097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat' ,' arr [hg19] Xp22.33p22.12(60701-21536551)x1~3,Xq21.31q28(90731177-155208244)x1 ish',' arr 11p15.5(2097357-2432381) x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)',' nuc ish(D21S259 / D21S341 / D21S342x3).arr(21)x3,10q26.12 (121812494-122486677)X1'
我正在尝试使用以下代码,但适用于所有情况,
x < - c(&#39; arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3&#39;,&#39; arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat&#39;,&#39; arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat&#39;, &#39; arr [hg19] Xp22.33p22.12(60701-21536551)x1~3 Xq21.31q28(90731177-155208244)x1 ish&#39;,&#39; arr 11p15.5(2097357-2432381)x3, 11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)&#39;,&#39; nuc ish(D21S259 / D21S341 / D21S342x3).arr(21)x310q26.12(121812494-122486677) )x1&#39;)sub(pattern =&#39;([)] x [1 | 2 | 3 | 1~2 | 1~3] \ s [mat | pat | dn]?))&#39; ,replacement =&#39; \ 1,&#39;,x = x)
答案 0 :(得分:0)
可以执行以下操作
x <- c(
'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3',
'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat',
'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat'
)
sub(pattern = "([(][0-9]+-[0-9]+[)]x[0-9])([^[:space:]].*)", replacement = "\\1,\\2", x = x)
以下是一个简短的解释:
1)匹配项(xxx-xxx)x1
的正则表达式为[(][0-9]+-[0-9]+[)]x[0-9]
,此处我使用[]
而不是转义匹配(
。休息可以被读作数字[0-9]+
后跟-
,后跟数字[0-9]+
后跟)
,x
和数字[0-9]
。
2)稍后使用捕获组拆分字符串和concat,我们将字符串分割为非空白字符,后跟任意数量的字符([^[:space:]].*)
,以便1中的模式位于第一组,其余位于第二组。并且连接2个组,添加,
,例如"\\1,\\2"