我正在尝试编写一个名为scan-for
的函数,该函数将字符串的集合(“令牌”)作为输入,返回一个“ tokenizer”函数,该函数将一个字符串作为输入,返回一个(最好是惰性)字符串序列
由输入中包含的“令牌”(以贪婪方式识别)组成,以及输入之间的非空子字符串,以及在输入开始和结束时的非空子字符串(按它们在输入中出现的顺序)。
例如((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
应该产生:
("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")
在我的第一次尝试中,我使用正则表达式匹配“令牌”(带有re-seq
)并找到中间的子字符串(带有split
),然后交错生成的序列。问题在于,由于split
,使用构造的正则表达式对输入字符串进行了两次解析,并且所得的序列也不是惰性的。
[在scan-for
的定义中,我使用了tacit/point-free style(避免使用lambda及其加糖的伪装),我觉得它通常很优雅且有用(John Backus would probably agree)。在Clojure中,这需要扩展使用partial
来照顾未处理的函数。如果您不喜欢它,可以添加lambda,threading-macros等。]
(defn rpartial
"a 'right' version of clojure.core/partial"
[f & args] #(apply f (concat %& args)))
(defn interleave*
"a 'continuing' version of clojure.core/interleave"
[& seqs]
(lazy-seq
(when-let [seqs (seq (remove empty? seqs))]
(concat
(map first seqs)
(apply interleave* (map rest seqs))))))
(defn make-fn
"makes a function from a symbol and an (optional) arity"
([sym arity]
(let [args (repeatedly arity gensym)]
(eval (list `fn (vec args) (cons sym args)))))
([sym] (make-fn sym 1)))
(def scan-for
(comp
(partial comp
(partial remove empty?)
(partial apply interleave*))
(partial apply juxt)
(juxt
(partial rpartial clojure.string/split)
(partial partial re-seq))
re-pattern
(partial clojure.string/join \|)
(partial map (make-fn 'java.util.regex.Pattern/quote))
(partial sort (comp not neg? compare))))
在第二次尝试中,我使用正则表达式来匹配“令牌”和中间的单个符号,然后将这些单个符号分组。在这里,我不喜欢在正则表达式匹配之外完成的处理量。
(defn scan-for [tokens]
(comp
(partial remove empty?)
(fn group [s]
(lazy-seq
(if-let [[sf & sr] s]
(if (or (get sf 1)
(some (partial = sf) tokens))
(list* "" sf (group sr))
(let [[gf & gr] (group sr)]
(cons (str sf gf) gr)))
(cons "" nil))))
(->> tokens
(sort (comp not neg? compare))
(map #(java.util.regex.Pattern/quote %))
(clojure.string/join \|)
(#(str % "|(?s)."))
(re-pattern)
(partial re-seq))))
那么,有什么方法可以使用一些合适的正则表达式来解析一次输入,并在该解析之外进行最少的处理?
(split
的惰性版本,它还返回正则表达式匹配项,如果有的话会很有帮助。)
答案 0 :(得分:0)
这是一个快速的版本,它并不懒惰,但我认为可以像在您的版本中那样在一些地方使用lazy-seq
来变成一个懒惰的版本:
(defn scan-for
([tokens text unmatched xs]
(if (empty? text)
(concat xs [unmatched])
(let [matching (filter #(clojure.string/starts-with? text %) tokens)]
(if (empty? matching)
(recur tokens (subs text 1) (str unmatched (subs text 0 1)) xs)
(let [matched (first matching)]
(recur tokens
(subs text (count matched))
""
(concat xs
(when-not (empty? unmatched) [unmatched])
[matched])))))))
([tokens text]
(scan-for tokens text "" [])))
;; (scan-for ["an" "ban" "banal" "d"] "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")
编辑:
这是一个非常有趣的游戏,所以我不得不尝试一下。我发现clojure.string/split
也采用了一个可选参数,并限制了它将产生的拆分数量。假设达到上限将不会扫描其余的输入,则可以根据原始建议实施该输入:
(defn create-regex [xs]
(->> xs (interpose "|") (apply str) re-pattern))
(defn split-lazy [s re]
(when-not (empty? s)
(let [[part remaining] (clojure.string/split s re 2)]
(lazy-seq (cons part (split-lazy remaining re))))))
(defn scan-lazy [xs s]
(let [re (create-regex xs)
no-matches (split-lazy s re)
matches (concat (re-seq re s) (repeat nil))]
(remove empty?
(interleave no-matches matches))))
(defn scan-for [xs] (partial scan-lazy xs))
;; ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")
在上面的代码中,我使用了一个技巧,在matches
上填充nil
,这样interleave
可以使用两个集合,否则当其中一个结束时它将停止。
您也可以检查它是否是懒惰的:
bananas.core> (def bananas ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities"))
#'bananas.core/bananas
bananas.core> (realized? bananas)
false
bananas.core> bananas
("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")
bananas.core> (realized? bananas)
true
编辑2:
如果按减少的长度对令牌进行排序,则会得到您期望的“贪婪”版本:
(defn create-regex [xs]
(->> xs (sort-by count) reverse (interpose "|") (apply str) re-pattern))
;; ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")
答案 1 :(得分:0)
我想出了一个看起来可以接受的解决方案。它基于捕获组和*?
量词(*
的勉强/非贪婪版本)的使用。
这里是:
(defn scan-for [tokens]
(comp
(partial remove empty?)
flatten
(partial map rest)
(->> tokens
(sort (comp not neg? compare)) ;alternatively, we can short by decreasing length
(map #(java.util.regex.Pattern/quote %))
(clojure.string/join \|)
(#(str "((?s).*?)(" % "|\\z)"))
(re-pattern)
(partial re-seq))))
以默认的方式:
(def scan-for
(comp
(partial comp
(partial remove empty?)
flatten
(partial map rest))
(partial partial re-seq)
re-pattern
(partial str "((?s).*?)(")
(rpartial str "|\\z)")
(partial clojure.string/join \|)
(partial map (make-fn 'java.util.regex.Pattern/quote))
(partial sort (comp not neg? compare))))