Clojure re-seq regex意外结果

时间:2015-03-12 14:43:55

标签: regex clojure

我在为以下方法找出正确的正则表达式时遇到了一些麻烦:

我有一个输入文件,我试图根据关键字表达式分组。这是文件的一个例子(让我们称这个案例为1):

Foo: B
  "This is instance B of type Foo"
  Bar: X
  etc.

Foo: C
  "This is instance C of type Foo"
  Bar: Y
  etc.

以下正则表达式:

#"(?s)(Foo:)(?:(?!Foo:).)*"

就像魅力一样,产生了我预期的结果:

(["Foo: B\n  \"This is instance B of type Foo\"\n  Bar: X\n  etc.\n\n"
  "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])

但是,如果有人在评论'Foo'中添加了冒号,那么它会变得很时髦并导致:

(["Foo: B\n  \"This is instance B of type " "Foo:"]
 ["Foo:\"\n  Bar: X\n  etc.\n\n" "Foo:"]
 ["Foo: C\n  \"This is instance C of type Foo\"\n  Bar: Y\n  etc.\n\n\n"
  "Foo:"])

如果在测试中,我从输入中删除Foo: C and it's content并将正则表达式更改为:

"(?s)(Foo:)(?:(?!\"Foo:\").)*"

我得到了预期的结果:

(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\n\n\n"
  "Foo:"])

Foo: C添加回混音中,然而,它不再尊重边界并导致:

(["Foo: B\n  \"This is instance B of type Foo:\"\n  Bar: X\n  etc.\n\nFoo: C\n  \"This is instance C of type Foo:\"\n  Bar: Y\n  etc.\n\n\n\n"
  "Foo:"])

我试过这个,但无济于事:#"(?s)(Foo:)(?:(?!Foo:|\"Foo:\").)*"说出几千次不成功的旋转。

我感谢任何帮助。目的是与正则表达式一起进行文件的分块。

当前解决方案 离开regex只是为了处理我需要的简单分块。第一个解决方案是循环/复发情况,其中一些(太多)条件和变异原子作为累积图。

我一直渴望用reduce做一些特定的事情,虽然可能不是最好的应用程序,但我在本练习中学到了它并且删除了过多的代码行。

(def owl-type-map
    {
     "Prefix:"               :prefixes
     "AnnotationProperty:"   :annotation-properties
     "Ontology:"             :ontology
     "Datatype:"             :data-types
     "DataProperty:"         :data-properties
     "ObjectProperty:"       :object-properties
     "Class:"                :classes
     "Individual:"           :individuals
     "EquivalentClasses:"    :miscellaneous
     "DisjointClasses:"      :miscellaneous
     "EquivalentProperties:" :miscellaneous
     "DisjointProperties:"   :miscellaneous
     "SameIndividual:"       :miscellaneous
     "DifferentIndividuals:" :miscellaneous
     })

  (def owl-control (reduce #(assoc %1 (second %2) nil) {:current nil} owl-type-map))

  (def space-split #(s/split (str %) #" "))

  (defn owl-chunk
    "Reduce ready function to accumulate a series of strings associated to
    particular instaparse EBNF productions (e.g. Class:, Prefix:, Ontology:).
    owl-type-map refers to the association between owl-type (string) and EBNF production"
    [acc v]
    (let [odex  (:current acc)
          stip  ((comp first space-split) v)
          index (get owl-type-map stip odex)
          imap  (if (= index odex) acc (assoc-in k [:current] index))
          ]
      (assoc-in imap [index] (str (get imap index) v "\n"))))

;; Calling

(reduce owl-chunk owl-control s) 

1 个答案:

答案 0 :(得分:0)

您可能需要考虑使用解析器生成器。 Mark Engelberg的Instaparse是一个优秀的Clojure解析库,旨在使其成为一个简单的选择 - 其自述文件的第一行是如果无上下文语法与正则表达式一样易于使用怎么办? / em>的

以下是如何使用它来解析样本输入的示例:

;; [instaparse "1.3.5"]
(require '[instaparse.core :as insta])

(def p (insta/parser "

S = Group*
Group = GroupHeader GroupComment GroupBody
GroupHeader = #'[A-Za-z]+' ': ' #'[A-Za-z]+' '\n'
GroupComment = ws? '\"' #'[^\"]+' '\"\n'
GroupBody = Line*
Line = #'.*' '\n'
ws = #'\\s+'

"))

(p "Foo: B
  \"This is instance B of type Foo\"
  Bar: X
Foo: C
  \"This is instance C of type Foo\"
  Bar: Y
")
;;=
[:S
 [:Group
  [:GroupHeader "Foo" ": " "B" "\n"]
  [:GroupComment [:ws "  "] "\"" "This is instance B of type Foo" "\"\n"]
  [:GroupBody
   [:Line "  Bar: X" "\n"]]]
 [:Group
  [:GroupHeader "Foo" ": " "C" "\n"]
  [:GroupComment [:ws "  "] "\"" "This is instance C of type Foo" "\"\n"]
  [:GroupBody
   [:Line "  Bar: Y" "\n"]]]]

在qouted字符串中的“Foo”之后添加冒号不会有问题。 (当然上面的语法非常简单 - 我想你可能想在Bar:等处启动嵌套组。)