我有一个数据框,其中有按此顺序命名的列“ m.z,强度,相对,Δ,RBD等效和组成”,并且在每一行中这些参数根据特定分子填充。在“成分”列中,它包含诸如“ C7 H11 O4”之类的代码。在其中,我可以将元素“ C,H,O等……”分隔为列标题,而在其下方将是根据分子式的元素数目。但是,当成分中显示碳同位素“ C11 [13] C H21 O N3 S2”时,该代码将失败,并给我一个错误。我希望能够将[13] C作为自己的色谱柱,以便与其他分子区分开来。
我的data.frame如下所示,但有数百个组成。数据框来自csv文件,以供参考。我不确定在gsub中使用哪种模式,以便[13] C变成具有相应行的列。
#This is how my data frame looks like but with more rows
#m.z Intensity Relative Delta. RBD.equiv Composition
#275 7555870 100 -0.49 0.0 C3 [13]C H4 O2
#136 126098 70.67 -2.72 5.5 C7 H11 O4 Na S
data <- dataframe%>%mutate(Composition=gsub("\\b([A-Za-z]+)\\b","\\11",Composition),
name=str_extract_all(Composition,"[A-Za-z]+"),
value=str_extract_all(Composition,"\\d+"))%>%
unnest()%>%spread(name,value,fill=0)
#I expect to see something like this when I print my results
#m.z Intensity Relative Delta. RBD.equiv Composition C [13]C H O Na
#275 7555870 100 -0.49 0.0 C3 [13]C H4 3 1 4 0 0
#133 126098 70.67 -2.72 5.5 C7 H5 O4 Na 7 0 5 4 1
答案 0 :(得分:0)
编辑:我设法在您的代码中修复了常规表达式:
data <- dataframe %>% mutate(Composition = gsub("\\b([A-Za-z]+)\\b", "\\11", Composition),
name=str_extract_all(Composition, "(\\[[0-9]+\\])*[A-Za-z]+"), #allow numer in square bracket before element
value=str_extract_all(Composition, "(?<!\\[[0-9]{0,5})[0-9]+")) %>% #only numbers that are not in square brackets (I expect the number in square bracket has 5 digits max)
unnest() %>% spread(name, value, fill = 0)
我的第一个解决方案是先从分子式中分离元素,然后对每个元素应用正则表达式:
(请注意,我已经习惯了在解决方案中使用splitstackshape包进行分离。如果您对其他任何解决方案都熟悉,则可以更改它)
data <- dataframe %>% mutate(CompositionCopy = gsub("\\b([A-Za-z]+)\\b", "\\11", Composition)) %>% #your code
splitstackshape::cSplit("CompositionCopy", " ", fixed = TRUE, direction = "long", type.convert = FALSE) %>% #split
mutate(name = str_extract_all(CompositionCopy, ".*[A-Za-z]+"), #included .* into your regex
value = str_extract_all(CompositionCopy, "\\d+$") %>% as.integer()) %>% #included $ in your regex to only get number at the end
select(-CompositionCopy) %>%
spread(name, value, fill = 0L)