在第4列中分隔两个字母的字符串

时间:2018-02-05 23:19:54

标签: r data-science tidyr

我有一个数据框 - df - 包含基因组数据。最后一个col有两个字母的变体。

System.out.println("Toppings: " + Arrays.stream(who.getToppings())
                                        .filter(Objects::nonNull)
                                        .collect(Collectors.joining(", ", "[", "]")));

我想将最后的col分成两个cols,每个cols一个字母

               id crm     pos allele
160841  rs2237282  11 1273948     AG
160842  rs6417577  11 1276796     AC
165677  rs2151342  11 1199626     GT
165678  rs2749240  11 1258025     AG

我在使用dplyr和tidyr的RStudio 1.1.419,R 3.4.3中尝试过但没有成功:

  • 分开(df,allele,into = c(" allele"," allele2"))
  • 分开(df,allele,into = c(" allele"," allele2"),sep ="")
  • 分开(df,allele,into = c(" allele"," allele2"),sep =" \ c")
  • 分开(df,allele,into = c(" allele"," allele2"),sep ="。")
  • 分开(df,allele,into = c(" allele"," allele2"),sep =。)
  • 分开(df,allele,into = c(" allele"," allele2"),sep = \ c)

如何最终得到所需的分割?

4 个答案:

答案 0 :(得分:6)

使用BASE r:

HERE=data.frame(A1=character(),A2=character())
cbind(data,strcapture("(.)(.)",data$allele,HERE))
              id crm     pos allele A1 A2
160841 rs2237282  11 1273948     AG  A  G
160842 rs6417577  11 1276796     AC  A  C
165677 rs2151342  11 1199626     GT  G  T
165678 rs2749240  11 1258025     AG  A  G

答案 1 :(得分:5)

separate中,sep参数可以是数字,表示要拆分的字符位置:

separate(df, allele, into = c("allele1", "allele2"), sep = 1)

,并提供:

              id crm     pos allele1 allele2
160841 rs2237282  11 1273948       A       G
160842 rs6417577  11 1276796       A       C
165677 rs2151342  11 1199626       G       T
165678 rs2749240  11 1258025       A       G

答案 2 :(得分:1)

library(tidyverse)

df %>%
    mutate(allele2 = substr(allele, 2, 2)) %>%
    mutate(allele = substr(allele, 1, 1))

答案 3 :(得分:0)

除了separate之外,extract包中的另一个选项。这可以通过在regex参数中指定捕获组来实现。

library(tidyr)

df %>%
  extract(allele, into = c("allele1", "allele2"), regex = "([ATCG])([ATCG])")
#               id crm     pos allele1 allele2
# 160841 rs2237282  11 1273948       A       G
# 160842 rs6417577  11 1276796       A       C
# 165677 rs2151342  11 1199626       G       T
# 165678 rs2749240  11 1258025       A       G