找到(L,t)-clump

时间:2013-11-20 05:07:36

标签: algorithm clojure bioinformatics

我正试图在Clojure中找到丛生。基本上,我需要找到发生在t次基因组中的大小为L的窗口中的所有k长度子串。我已经实现了我认为的解决方案,但我相信可能存在漏洞,因为我用来确认的系统(beta.stepic.org)告诉我。你们能找到我弄乱的地方吗?我的解决方案如下,找到所有排名靠前的k-mers(k长度子串)并找到它们的起始索引。之后,我将t分组,这意味着这是它们发生的次数,并且基本上对分组组中的最后一项和第一项的差异为k(因为所有k-mers都应该适合于L -window,这将通过扩展它来解释最后一个k-mer)。指数按升序排列。错误在哪里?

Clump Finding问题:找到在字符串中形成团块的图案。

 Input: A string Genome, and integers k, L, and t.
 Output: All distinct k-mers forming (L, t)-clumps in Genome.

示例输入

  

基因组:CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA

 k: 5 
 L: 50 
 t: 4

示例输出

  

CGACA GAAGA

(defn get-indices [source target]
  "Returns the indices for the substring target
   found in source in ascending order. This includes overlaps."
  (let
    [search   (java.util.regex.Pattern/compile (str "(?=(" target "))"))
     matcher  (re-matcher search source)
     not-nil? (complement nil?)]

    (defn inner [matcher]
      (if (not-nil? (re-find matcher))
        (cons (.start matcher) (inner matcher))))
          (inner matcher)))

(defn get-frequent-kmer [source k]
  "Gets the most frequenct k-mers of size k from source"
  (let [max-val (val (apply max-key val (frequencies (partition k 1 source))))]
    (map first (filter #(= (val %) max-val)
      (frequencies (map (partial apply str) (partition k 1 source)))))))


(defn find-clumps [genome k L t]
  (for [k-mer (get-frequent-kmer genome k)]
    (let [indices (get-indices genome k-mer)]
      (if (some true? (map #(<= (+ k (- (last %) (first %))) L)
        (partition t 1 indices))) k-mer))))

2 个答案:

答案 0 :(得分:1)

除了代码风格有一些可以改进的东西之外,我看到的主要问题是你在max-key val上过滤了k-mers而你在初始阶段根本没有考虑t过滤

当您找到最常用的尺寸为k的l-mers时,您只是保留较长的l-mers:

  (apply max-key val (frequencies (partition k 1 source)))

因为你按照max-val

进行过滤
  (filter #(= (val %) max-val)

你只是在分析那些:

  (for [k-mer (get-frequent-kmer genome k)]

问题在于,如果t为4,但你有一些重复超过4次的5-mers,那么你将重复这些重复4次。

答案 1 :(得分:0)

以下是一些有效的代码:

(defn k-mers 
  "Returns a seq of all k-mers in text."
  [k text]
  (map #(apply str %) (partition k 1 text)))

(defn most-frequent-k-mers 
  "Returns a seq of k-mers in text appearing at least t times."
  [k t text]
  (->> (k-mers k text)
       (frequencies)
       (filter #(<= t (second %)))
       (map first)))

(defn find-clump
  "Finds k-mers forming (L, t) clumps in text."
  [k L t text]
  (let [windows (partition L 1 text)]
    (->> windows 
         (map #(most-frequent-k-mers k t %))
         (map set)
         (apply clojure.set/union))))

我认为你应该从这里开始。