Question

我目前在一系列for循环中进行主题搜索，并希望移至嵌套的小标题以提高速度和简便性（ish）。但是，我无法弄清楚如何在小标题中存储小标题，因此可以将其嵌套。如果无法实现，那么将感谢您提供有关如何传递列表（以及ID列）的技巧，以便以后可以将其加入到原始表中。

输入：一组坐标和相应的DNA序列

目标：
1）找到我关心的主题实例
2）将它们与范围的起点或终点结合以创建所有起点和终点对（其中找到的位置可以是其中任意一个）
3）确定配对的类型

我无法弄清楚如何使突变接受小插曲（mutate_impl（.data，点）中的错误：“ pairs”列是不受支持的类data.frame）。我在这里不能按行调用，因为我需要将整个位置列表以及其他列中的值发送给函数。

test_input = tibble(
  start = c(1,10,15), 
  end = c(9, 14, 25),  
  sequence = c("GAGAGAGTC","CATTT", "TCACAGTTTCC")
)

custom_function = function(start, end, list.of.positions) {
  ## Doesn't include extra math, case specifications, and error handling here for simplicity
  starts = c(start, list.of.positions)
  ends = c(end, list.of.positions)
  pairs = expand.grid(starts, ends) %>% as_tibble %>% 
    mutate(type = case_when(TRUE ~ "a_type")) #Simplified for example to one case 
  return(pairs)
}

test_input %>% 
# for each set of coordinates/string
  rowwise() %>% 
  # find the positions of a given motif
  mutate(match.positions = regexp.match.ends(gregexpr("AG", sequence))) %>% 
  mutate(num.matches = case_when(
    is_logical(match.positions) ~ NA_integer_,
    TRUE ~ length(match.positions) 
  )) %>% 
  # expand and covert to real positions
  unnest %>% rowwise %>% 
  mutate(true.positions = case_when(
    is.na(match.positions) ~ NA_real_, #must be a double-compatible NA
    TRUE ~ start + match.positions - 1)) %>% 
  select(-match.positions) %>% 
  ungroup() %>% 
  # re-"nest" into a list of real positions
  group_by_at(vars(-true.positions)) %>% 
  summarise(true.positions = list(true.positions)) %>% 
  # pass list of real positions to a function that creates pairs of coordinates and determines the type of pair
  mutate(pairs = custom_function(start, end, true.positions))

我最后的小节应该是这样的（在取消配对之后）：

  start   end  sequence      new.start  new.end   type  
  <dbl> <dbl>  <chr>         <dbl>      <dbl>    <chr>   
1     1     9  GAGAGAGTC     1          3        a_type
1     1     9  GAGAGAGTC     1          5        a_type
2     1     9  GAGAGAGTC     1          7        a_type
3     1     9  GAGAGAGTC     1          9        a_type
4     1     9  GAGAGAGTC     3          5        a_type
...
10    1     9  GAGAGAGTC     7          9        a_type
11    10    14 CATTT         10         14       a_type
...

我想到的一种解决方法是将输出值粘贴到字符串中，然后将其作为列表传递回去，该选项可以容忍，取消嵌套，然后将其分隔开，但是肯定有一种不太麻烦的方法可以解决此问题。非常感谢您的帮助/想法！

Answer 1

因此，我对主题完全不熟悉。但是我想我可以拼凑出您要做什么。我喜欢使用stringr软件包，因为它使用简单的语法完成了很多工作。

test_input <- tibble(
  start = c(1,10,15), 
  end = c(9, 14, 25),  
  sequence = c("GAGAGAGTC","CATTT", "TCACAGTTTCC")
)

custom_function <- function(string, pattern, label) {
    string %>%
        str_locate_all(pattern) %>%    # get the start-end pairs.
        as.data.frame() %>%    # make it a data.frame
        expand.grid() %>%    # all combos. this seemed important.
        mutate(
            sequence = string,
            type = label
            ) %>%    # add the string and label to each row.
        %>% rename(
            new_start = start,    # rename so we don't confuse columns.
            new_end = end         # I prefer not to use dots in my names.
            ) %>%
        left_join(test_input) %>%    # add the original start and ends
        return()    # return df has cols: start, end, sequence, new_start, new_end, type.
}

final_out <- data.frame(
    start = numeric(0),
    end = numeric(0),
    sequence = character(0),
    new_start = numeric(0),
    new_end = numeric(0)
    )    # empty dummy DF that we'll add to.

for (string in test_input$sequence) {
    final_out <- custom_function(string = string,
                                 pattern = 'AG',
                                 label = 'a_type') %>%
        bind_rows(final_out)
}    # add the rows of each output to the final DF we made.

print(final_out)

您似乎试图根据提供的模式来标记结果，因此可以指定'a_type'或所需的任何标签。

也许可以通过使用map或apply函数在没有for循环的情况下做到这一点。我必须四处修补才能弄清楚这一点。

希望能帮助您，或者至少将您引向正确的方向。就像我说的那样，我对主题不熟悉。

对嵌套列进行突变以导致不支持的类（data.frame）

1 个答案: