我正试图在Clojure中找到丛生。基本上,我需要找到发生在t次基因组中的大小为L的窗口中的所有k长度子串。我已经实现了我认为的解决方案,但我相信可能存在漏洞,因为我用来确认的系统(beta.stepic.org)告诉我。你们能找到我弄乱的地方吗?我的解决方案如下,找到所有排名靠前的k-mers(k长度子串)并找到它们的起始索引。之后,我将t分组,这意味着这是它们发生的次数,并且基本上对分组组中的最后一项和第一项的差异为k(因为所有k-mers都应该适合于L -window,这将通过扩展它来解释最后一个k-mer)。指数按升序排列。错误在哪里?
Input: A string Genome, and integers k, L, and t.
Output: All distinct k-mers forming (L, t)-clumps in Genome.
示例输入:
基因组:CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA
k: 5
L: 50
t: 4
示例输出:
CGACA GAAGA
(defn get-indices [source target]
"Returns the indices for the substring target
found in source in ascending order. This includes overlaps."
(let
[search (java.util.regex.Pattern/compile (str "(?=(" target "))"))
matcher (re-matcher search source)
not-nil? (complement nil?)]
(defn inner [matcher]
(if (not-nil? (re-find matcher))
(cons (.start matcher) (inner matcher))))
(inner matcher)))
(defn get-frequent-kmer [source k]
"Gets the most frequenct k-mers of size k from source"
(let [max-val (val (apply max-key val (frequencies (partition k 1 source))))]
(map first (filter #(= (val %) max-val)
(frequencies (map (partial apply str) (partition k 1 source)))))))
(defn find-clumps [genome k L t]
(for [k-mer (get-frequent-kmer genome k)]
(let [indices (get-indices genome k-mer)]
(if (some true? (map #(<= (+ k (- (last %) (first %))) L)
(partition t 1 indices))) k-mer))))
答案 0 :(得分:1)
除了代码风格有一些可以改进的东西之外,我看到的主要问题是你在max-key val
上过滤了k-mers而你在初始阶段根本没有考虑t
过滤
当您找到最常用的尺寸为k
的l-mers时,您只是保留较长的l-mers:
(apply max-key val (frequencies (partition k 1 source)))
因为你按照max-val
进行过滤 (filter #(= (val %) max-val)
你只是在分析那些:
(for [k-mer (get-frequent-kmer genome k)]
问题在于,如果t
为4,但你有一些重复超过4次的5-mers,那么你将重复这些重复4次。
答案 1 :(得分:0)
以下是一些有效的代码:
(defn k-mers
"Returns a seq of all k-mers in text."
[k text]
(map #(apply str %) (partition k 1 text)))
(defn most-frequent-k-mers
"Returns a seq of k-mers in text appearing at least t times."
[k t text]
(->> (k-mers k text)
(frequencies)
(filter #(<= t (second %)))
(map first)))
(defn find-clump
"Finds k-mers forming (L, t) clumps in text."
[k L t text]
(let [windows (partition L 1 text)]
(->> windows
(map #(most-frequent-k-mers k t %))
(map set)
(apply clojure.set/union))))
我认为你应该从这里开始。