将文本重新格式化为R中的表

时间:2020-07-01 11:05:55

标签: r string text data-structures reshape

我想在重塑文本文件时寻求社区的帮助。文本文件如下所示:

TRINITY_GG_17866_c6_g1_i1
TRINITY_GG_17866_c3_g1_i1
TRINITY_GG_17866_c1_g1_i7
GO:0000226
GO:0006139
GO:0006259
TRINITY_GG_17866_c5_g1_i1
GO:0003674
GO:0005488

我最后想要得到的是这样的(用制表符分隔)

TRINITY_GG_17866_c1_g1_i7 GO:0000226
TRINITY_GG_17866_c1_g1_i7 GO:0006139
TRINITY_GG_17866_c1_g1_i7 GO:0006259
TRINITY_GG_17866_c5_g1_i1 GO:0003674
TRINITY_GG_17866_c5_g1_i1 GO:0005488

到目前为止,我仍未提出解决方案。对于这个问题,我将不胜感激。

最好的祝福,Ferenc

1 个答案:

答案 0 :(得分:1)

一个dplyr选项可能是:

df %>%
 group_by(grp = cumsum(!startsWith(V1, "GO:"))) %>%
 filter(n() > 1) %>%
 mutate(V2 = lead(V1),
        V1 = first(V1)) %>%
 na.omit() %>%
 ungroup() %>%
 select(-grp)

  V1                        V2        
  <chr>                     <chr>     
1 TRINITY_GG_17866_c1_g1_i7 GO:0000226
2 TRINITY_GG_17866_c1_g1_i7 GO:0006139
3 TRINITY_GG_17866_c1_g1_i7 GO:0006259
4 TRINITY_GG_17866_c5_g1_i1 GO:0003674
5 TRINITY_GG_17866_c5_g1_i1 GO:0005488

或作为一列:

df %>%
 group_by(grp = cumsum(!startsWith(V1, "GO:"))) %>%
 filter(n() > 1) %>%
 mutate(V2 = lead(V1),
        V1 = first(V1)) %>%
 na.omit() %>%
 ungroup() %>%
 select(-grp) %>%
 transmute(V1 = paste(V1, V2))

  V1                                  
  <chr>                               
1 TRINITY_GG_17866_c1_g1_i7 GO:0000226
2 TRINITY_GG_17866_c1_g1_i7 GO:0006139
3 TRINITY_GG_17866_c1_g1_i7 GO:0006259
4 TRINITY_GG_17866_c5_g1_i1 GO:0003674
5 TRINITY_GG_17866_c5_g1_i1 GO:0005488

样本数据:

df <- read.table(text = "TRINITY_GG_17866_c6_g1_i1
TRINITY_GG_17866_c3_g1_i1
TRINITY_GG_17866_c1_g1_i7
GO:0000226
GO:0006139
GO:0006259
TRINITY_GG_17866_c5_g1_i1
GO:0003674
GO:0005488",
                 header = FALSE,
                 stringsAsFactors = FALSE)
相关问题