我应该包括哪种模式以便我的代码可以识别方括号?

时间:2019-07-16 01:37:23

标签: r gsub mutate

我有一个数据框,其中有按此顺序命名的列“ m.z,强度,相对,Δ,RBD等效和组成”,并且在每一行中这些参数根据特定分子填充。在“成分”列中,它包含诸如“ C7 H11 O4”之类的代码。在其中,我可以将元素“ C,H,O等……”分隔为列标题,而在其下方将是根据分子式的元素数目。但是,当成分中显示碳同位素“ C11 [13] C H21 O N3 S2”时,该代码将失败,并给我一个错误。我希望能够将[13] C作为自己的色谱柱,以便与其他分子区分开来。

我的data.frame如下所示,但有数百个组成。数据框来自csv文件,以供参考。我不确定在gsub中使用哪种模式,以便[13] C变成具有相应行的列。

#This is how my data frame looks like but with more rows

#m.z  Intensity  Relative  Delta.  RBD.equiv  Composition 
#275  7555870    100       -0.49   0.0        C3 [13]C H4 O2
#136  126098     70.67     -2.72   5.5        C7 H11 O4 Na S

    data <- dataframe%>%mutate(Composition=gsub("\\b([A-Za-z]+)\\b","\\11",Composition),

    name=str_extract_all(Composition,"[A-Za-z]+"),

    value=str_extract_all(Composition,"\\d+"))%>%
    unnest()%>%spread(name,value,fill=0)

#I expect to see something like this when I print my results

#m.z Intensity Relative Delta. RBD.equiv Composition   C [13]C H O Na
#275 7555870   100      -0.49  0.0       C3 [13]C H4   3 1     4 0 0
#133 126098    70.67    -2.72  5.5       C7 H5 O4 Na   7 0     5 4 1


1 个答案:

答案 0 :(得分:0)

编辑:我设法在您的代码中修复了常规表达式:

data <- dataframe %>% mutate(Composition = gsub("\\b([A-Za-z]+)\\b", "\\11", Composition),
                           name=str_extract_all(Composition, "(\\[[0-9]+\\])*[A-Za-z]+"), #allow numer in square bracket before element
                           value=str_extract_all(Composition, "(?<!\\[[0-9]{0,5})[0-9]+")) %>% #only numbers that are not in square brackets (I expect the number in square bracket has 5 digits max)
    unnest() %>% spread(name, value, fill = 0)

我的第一个解决方案是先从分子式中分离元素,然后对每个元素应用正则表达式:

(请注意,我已经习惯了在解决方案中使用splitstackshape包进行分离。如果您对其他任何解决方案都熟悉,则可以更改它)

data <-  dataframe %>% mutate(CompositionCopy = gsub("\\b([A-Za-z]+)\\b", "\\11", Composition)) %>% #your code
    splitstackshape::cSplit("CompositionCopy", " ", fixed = TRUE, direction = "long", type.convert = FALSE) %>% #split
    mutate(name = str_extract_all(CompositionCopy, ".*[A-Za-z]+"), #included .* into your regex
           value = str_extract_all(CompositionCopy, "\\d+$") %>% as.integer()) %>% #included $ in your regex to only get number at the end
    select(-CompositionCopy) %>% 
    spread(name, value, fill = 0L)