Question

假设您有一个Clojure源代码文件。文件本身可能如下所示：

(ns foo
  "We've got some sort of docstring here. \"this\" would be an example of
  some sort of escaped text within that docstring.")

(defn bar
  "Another docstring down here."
  [x]
  true)

现在，让我们假设，我想在这里捕获一个或两个文档字符串的内容。

问题在于，如果我将它啜饮到Clojure REPL中，一切都会被双重逃脱。所以它看起来像这样：

(ns foo\n\"We've got some sort of docstring here. \\\"this\\\" would be an example of\nsome sort of escaped text within that docstring.\")\n\n(defn bar\n\"Another docstring down here.\"\n[x]\ntrue)

到目前为止我一直在使用的正则表达式如下：

(re-find #"\"(\\.|[^\"])*\"" source-string)

这很合理，因为它通过了我能提出的所有琐碎的测试用例。但是，它不需要特别大的语料库来导致它遇到StackOverflowError。

所以，伟大巫师的存储库，我转向你。我应该使用不同的正则表达式吗？正则表达式只是错误的答案在这里？如果是这样，是什么？

Answer 1

您可以根据clojure.edn/read使用以下内容：

(defn expr-seq [in]
  (let [r (.read in)]
    (if (= -1 r)
      nil
      (do
        (.unread in r)
        (cons (clojure.edn/read in) (lazy-seq (expr-seq in)))))))

(defn doc-string [[_ _ ds]]
  (when (string? ds) ds))

(def sexps
  (with-open [in (-> (slurp "/path/to/file.clj")
                     clojure.string/trim
                     java.io.StringReader.
                     java.io.PushbackReader.)]
    (doall (expr-seq in))))

; docstrings 
(map doc-string sexps)

=> ("We've got some sort of docstring here. \"this\" would be an example of\n  some sort of escaped text within that docstring." "Another docstring down here.")

; all strings
(filter string? (tree-seq coll? seq sexps))

Java 6 / Clojure正则表达式捕获引号内的内容，双重转义引号除外......并且没有StackOverflowErrors

1 个答案: