在Clojure中使用正则表达式实现简单的扫描仪/令牌器

时间:2019-06-12 22:31:18

标签: java regex clojure match token

我正在尝试编写一个名为scan-for的函数,该函数将字符串的集合(“令牌”)作为输入,返回一个“ tokenizer”函数,该函数将一个字符串作为输入,返回一个(最好是惰性)字符串序列 由输入中包含的“令牌”(以贪婪方式识别)组成,以及输入之间的非空子字符串,以及在输入开始和结束时的非空子字符串(按它们在输入中出现的顺序)。

例如((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")应该产生:

("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")

在我的第一次尝试中,我使用正则表达式匹配“令牌”(带有re-seq)并找到中间的子字符串(带有split),然后交错生成的序列。问题在于,由于split,使用构造的正则表达式对输入字符串进行了两次解析,并且所得的序列也不是惰性的。

[在scan-for的定义中,我使用了tacit/point-free style(避免使用lambda及其加糖的伪装),我觉得它通常很优雅且有用(John Backus would probably agree)。在Clojure中,这需要扩展使用partial来照顾未处理的函数。如果您不喜欢它,可以添加lambda,threading-macros等。]

(defn rpartial
  "a 'right' version of clojure.core/partial"
  [f & args] #(apply f (concat %& args)))

(defn interleave*
  "a 'continuing' version of clojure.core/interleave"
  [& seqs]
  (lazy-seq
    (when-let [seqs (seq (remove empty? seqs))]
      (concat
        (map first seqs)
        (apply interleave* (map rest seqs))))))

(defn make-fn
  "makes a function from a symbol and an (optional) arity"
  ([sym arity]
   (let [args (repeatedly arity gensym)]
     (eval (list `fn (vec args) (cons sym args)))))
  ([sym] (make-fn sym 1)))

(def scan-for
  (comp
    (partial comp
      (partial remove empty?)
      (partial apply interleave*))
    (partial apply juxt)
    (juxt
      (partial rpartial clojure.string/split)
      (partial partial re-seq))
    re-pattern
    (partial clojure.string/join \|)
    (partial map (make-fn 'java.util.regex.Pattern/quote))
    (partial sort (comp not neg? compare))))

在第二次尝试中,我使用正则表达式来匹配“令牌”和中间的单个符号,然后将这些单个符号分组。在这里,我不喜欢在正则表达式匹配之外完成的处理量。

(defn scan-for [tokens]
  (comp
    (partial remove empty?)
    (fn group [s]
      (lazy-seq
        (if-let [[sf & sr] s]
          (if (or (get sf 1)
                  (some (partial = sf) tokens))
            (list* "" sf (group sr))
            (let [[gf & gr] (group sr)]
              (cons (str sf gf) gr)))
          (cons "" nil))))
    (->> tokens
         (sort (comp not neg? compare))
         (map #(java.util.regex.Pattern/quote %))
         (clojure.string/join \|)
         (#(str % "|(?s)."))
         (re-pattern)
         (partial re-seq))))

那么,有什么方法可以使用一些合适的正则表达式来解析一次输入,并在该解析之外进行最少的处理?

split的惰性版本,它还返回正则表达式匹配项,如果有的话会很有帮助。)

2 个答案:

答案 0 :(得分:0)

这是一个快速的版本,它并不懒惰,但我认为可以像在您的版本中那样在一些地方使用lazy-seq来变成一个懒惰的版本:

(defn scan-for
  ([tokens text unmatched xs]
   (if (empty? text)
     (concat xs [unmatched])
     (let [matching (filter #(clojure.string/starts-with? text %) tokens)]
       (if (empty? matching)
         (recur tokens (subs text 1) (str unmatched (subs text 0 1)) xs)
         (let [matched (first matching)]
           (recur tokens
                  (subs text (count matched))
                  ""
                  (concat xs
                          (when-not (empty? unmatched) [unmatched])
                          [matched])))))))
  ([tokens text]
   (scan-for tokens text "" [])))


;; (scan-for ["an" "ban" "banal" "d"] "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")

编辑:

这是一个非常有趣的游戏,所以我不得不尝试一下。我发现clojure.string/split也采用了一个可选参数,并限制了它将产生的拆分数量。假设达到上限将不会扫描其余的输入,则可以根据原始建议实施该输入:

(defn create-regex [xs]
  (->> xs (interpose "|") (apply str) re-pattern))

(defn split-lazy [s re]
  (when-not (empty? s)
    (let [[part remaining] (clojure.string/split s re 2)]
      (lazy-seq (cons part (split-lazy remaining re))))))

(defn scan-lazy [xs s]
  (let [re         (create-regex xs)
        no-matches (split-lazy s re)
        matches    (concat (re-seq re s) (repeat nil))]
    (remove empty?
            (interleave no-matches matches))))

(defn scan-for [xs] (partial scan-lazy xs))

;; ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")

在上面的代码中,我使用了一个技巧,在matches上填充nil,这样interleave可以使用两个集合,否则当其中一个结束时它将停止。

您也可以检查它是否是懒惰的:

bananas.core> (def bananas ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities"))
#'bananas.core/bananas
bananas.core> (realized? bananas)
false
bananas.core> bananas
("ban" " " "ban" "an" "as " "an" "d" " " "ban" "alities")
bananas.core> (realized? bananas)
true

编辑2:

如果按减少的长度对令牌进行排序,则会得到您期望的“贪婪”版本:

(defn create-regex [xs]
  (->> xs (sort-by count) reverse (interpose "|") (apply str) re-pattern))

;; ((scan-for ["an" "ban" "banal" "d"]) "ban bananas and banalities")
;; => ("ban" " " "ban" "an" "as " "an" "d" " " "banal" "ities")

答案 1 :(得分:0)

我想出了一个看起来可以接受的解决方案。它基于捕获组和*?量词(*的勉强/非贪婪版本)的使用。

这里是:

(defn scan-for [tokens]
  (comp
    (partial remove empty?)
    flatten
    (partial map rest)
    (->> tokens
         (sort (comp not neg? compare)) ;alternatively, we can short by decreasing length
         (map #(java.util.regex.Pattern/quote %))
         (clojure.string/join \|)
         (#(str "((?s).*?)(" % "|\\z)"))
         (re-pattern)
         (partial re-seq))))

以默认的方式:

(def scan-for
  (comp
    (partial comp
      (partial remove empty?)
      flatten
      (partial map rest))
    (partial partial re-seq)
    re-pattern
    (partial str "((?s).*?)(")
    (rpartial str "|\\z)")
    (partial clojure.string/join \|)
    (partial map (make-fn 'java.util.regex.Pattern/quote))
    (partial sort (comp not neg? compare))))