对嵌套列进行突变以导致不支持的类(data.frame)

时间:2019-02-01 04:41:51

标签: r dplyr nested mutate tibble

我目前在一系列for循环中进行主题搜索,并希望移至嵌套的小标题以提高速度和简便性(ish)。但是,我无法弄清楚如何在小标题中存储小标题,因此可以将其嵌套。如果无法实现,那么将感谢您提供有关如何传递列表(以及ID列)的技巧,以便以后可以将其加入到原始表中。

输入:一组坐标和相应的DNA序列

目标:
1)找到我关心的主题实例
2)将它们与范围的起点或终点结合以创建所有起点和终点对(其中找到的位置可以是其中任意一个)
3)确定配对的类型

我无法弄清楚如何使突变接受小插曲(mutate_impl(.data,点)中的错误:“ pairs”列是不受支持的类data.frame)。我在这里不能按行调用,因为我需要将整个位置列表以及其他列中的值发送给函数。

test_input = tibble(
  start = c(1,10,15), 
  end = c(9, 14, 25),  
  sequence = c("GAGAGAGTC","CATTT", "TCACAGTTTCC")
)

custom_function = function(start, end, list.of.positions) {
  ## Doesn't include extra math, case specifications, and error handling here for simplicity
  starts = c(start, list.of.positions)
  ends = c(end, list.of.positions)
  pairs = expand.grid(starts, ends) %>% as_tibble %>% 
    mutate(type = case_when(TRUE ~ "a_type")) #Simplified for example to one case 
  return(pairs)
}

test_input %>% 
# for each set of coordinates/string
  rowwise() %>% 
  # find the positions of a given motif
  mutate(match.positions = regexp.match.ends(gregexpr("AG", sequence))) %>% 
  mutate(num.matches = case_when(
    is_logical(match.positions) ~ NA_integer_,
    TRUE ~ length(match.positions) 
  )) %>% 
  # expand and covert to real positions
  unnest %>% rowwise %>% 
  mutate(true.positions = case_when(
    is.na(match.positions) ~ NA_real_, #must be a double-compatible NA
    TRUE ~ start + match.positions - 1)) %>% 
  select(-match.positions) %>% 
  ungroup() %>% 
  # re-"nest" into a list of real positions
  group_by_at(vars(-true.positions)) %>% 
  summarise(true.positions = list(true.positions)) %>% 
  # pass list of real positions to a function that creates pairs of coordinates and determines the type of pair
  mutate(pairs = custom_function(start, end, true.positions))

我最后的小节应该是这样的(在取消配对之后):

  start   end  sequence      new.start  new.end   type  
  <dbl> <dbl>  <chr>         <dbl>      <dbl>    <chr>   
1     1     9  GAGAGAGTC     1          3        a_type
1     1     9  GAGAGAGTC     1          5        a_type
2     1     9  GAGAGAGTC     1          7        a_type
3     1     9  GAGAGAGTC     1          9        a_type
4     1     9  GAGAGAGTC     3          5        a_type
...
10    1     9  GAGAGAGTC     7          9        a_type
11    10    14 CATTT         10         14       a_type
...

我想到的一种解决方法是将输出值粘贴到字符串中,然后将其作为列表传递回去,该选项可以容忍,取消嵌套,然后将其分隔开,但是肯定有一种不太麻烦的方法可以解决此问题。非常感谢您的帮助/想法!

1 个答案:

答案 0 :(得分:0)

因此,我对主题完全不熟悉。但是我想我可以拼凑出您要做什么。我喜欢使用stringr软件包,因为它使用简单的语法完成了很多工作。

test_input <- tibble(
  start = c(1,10,15), 
  end = c(9, 14, 25),  
  sequence = c("GAGAGAGTC","CATTT", "TCACAGTTTCC")
)

custom_function <- function(string, pattern, label) {
    string %>%
        str_locate_all(pattern) %>%    # get the start-end pairs.
        as.data.frame() %>%    # make it a data.frame
        expand.grid() %>%    # all combos. this seemed important.
        mutate(
            sequence = string,
            type = label
            ) %>%    # add the string and label to each row.
        %>% rename(
            new_start = start,    # rename so we don't confuse columns.
            new_end = end         # I prefer not to use dots in my names.
            ) %>%
        left_join(test_input) %>%    # add the original start and ends
        return()    # return df has cols: start, end, sequence, new_start, new_end, type.
}

final_out <- data.frame(
    start = numeric(0),
    end = numeric(0),
    sequence = character(0),
    new_start = numeric(0),
    new_end = numeric(0)
    )    # empty dummy DF that we'll add to.

for (string in test_input$sequence) {
    final_out <- custom_function(string = string,
                                 pattern = 'AG',
                                 label = 'a_type') %>%
        bind_rows(final_out)
}    # add the rows of each output to the final DF we made.

print(final_out)

您似乎试图根据提供的模式来标记结果,因此可以指定'a_type'或所需的任何标签。

也许可以通过使用mapapply函数在没有for循环的情况下做到这一点。我必须四处修补才能弄清楚这一点。

希望能帮助您,或者至少将您引向正确的方向。就像我说的那样,我对主题不熟悉。