Java 6 / Clojure正则表达式捕获引号内的内容,双重转义引号除外......并且没有StackOverflowErrors

时间:2015-08-07 21:27:57

标签: java regex clojure

假设您有一个Clojure源代码文件。文件本身可能如下所示:

(ns foo
  "We've got some sort of docstring here. \"this\" would be an example of
  some sort of escaped text within that docstring.")

(defn bar
  "Another docstring down here."
  [x]
  true)

现在,让我们假设,我想在这里捕获一个或两个文档字符串的内容。

问题在于,如果我将它啜饮到Clojure REPL中,一切都会被双重逃脱。所以它看起来像这样:

(ns foo\n\"We've got some sort of docstring here. \\\"this\\\" would be an example of\nsome sort of escaped text within that docstring.\")\n\n(defn bar\n\"Another docstring down here.\"\n[x]\ntrue)

到目前为止我一直在使用的正则表达式如下:

(re-find #"\"(\\.|[^\"])*\"" source-string)

这很合理,因为它通过了我能提出的所有琐碎的测试用例。但是,它不需要特别大的语料库来导致它遇到StackOverflowError。

所以,伟大巫师的存储库,我转向你。我应该使用不同的正则表达式吗?正则表达式只是错误的答案在这里?如果是这样,是什么?

1 个答案:

答案 0 :(得分:0)

您可以根据clojure.edn/read使用以下内容:

(defn expr-seq [in]
  (let [r (.read in)]
    (if (= -1 r)
      nil
      (do
        (.unread in r)
        (cons (clojure.edn/read in) (lazy-seq (expr-seq in)))))))

(defn doc-string [[_ _ ds]]
  (when (string? ds) ds))

(def sexps
  (with-open [in (-> (slurp "/path/to/file.clj")
                     clojure.string/trim
                     java.io.StringReader.
                     java.io.PushbackReader.)]
    (doall (expr-seq in))))

; docstrings 
(map doc-string sexps)

=> ("We've got some sort of docstring here. \"this\" would be an example of\n  some sort of escaped text within that docstring." "Another docstring down here.")

; all strings
(filter string? (tree-seq coll? seq sexps))