在Clojure中搜索xml

时间:2012-07-18 09:09:37

标签: xml clojure

我有以下示例xml:

<data>
  <products>
    <product>
      <section>Red Section</section>
      <images>
        <image>img.jpg</image>
        <image>img2.jpg</image>
      </images>
    </product>
    <product>
      <section>Blue Section</section>
      <images>
        <image>img.jpg</image>
        <image>img3.jpg</image>
      </images>
    </product>
    <product>
      <section>Green Section</section>
      <images>
        <image>img.jpg</image>
        <image>img2.jpg</image>
      </images>
    </product>
  </products>
</data>

我知道如何在Clojure中解析它

(require '[clojure.xml :as xml])
(def x (xml/parse 'location/of/that/xml'))

这将返回描述xml

的嵌套映射
{:tag :data,
 :attrs nil,
 :content [
     {:tag :products,
      :attrs nil,
      :content [
          {:tag :product,
           :attrs nil,
           :content [] ..

这个结构当然可以使用标准的Clojure函数遍历,但它可能会变得非常冗长,特别是与例如使用XPath查询它时相比。是否有任何助手可以遍历和搜索这样的结构?我怎么能,例如

  • 获取所有<product>
  • 的列表
  • 仅获取<images>标记包含<image>且文字为“img2.jpg”的产品
  • 获取section为“红色部分”的产品

由于

5 个答案:

答案 0 :(得分:9)

使用Zippers中的data.zip这是第二个用例的解决方案:

(ns core
  (:use clojure.data.zip.xml)
  (:require [clojure.zip :as zip]
            [clojure.xml :as xml]))

(def data (zip/xml-zip (xml/parse PATH)))
(def products (xml-> data :products :product))

(for [product products :let [image (xml-> product :images :image)]
                       :when (some (text= "img2.jpg") image)]
  {:section (xml1-> product :section text)
   :images (map text image)})
=> ({:section "Red Section", :images ("img.jpg" "img2.jpg")}
    {:section "Green Section", :images ("img.jpg" "img2.jpg")})

答案 1 :(得分:4)

以下是使用data.zip的备用版本,适用于所有三个用例。我发现xml->xml1->内置了非常强大的导航功能,并在向量中进行了子查询。

;; [org.clojure/data.zip "0.1.1"]

(ns example.core
  (:require
   [clojure.zip :as zip]
   [clojure.xml :as xml]
   [clojure.data.zip.xml :refer [text xml-> xml1->]]))

(def data (zip/xml-zip (xml/parse "/tmp/products.xml")))

(let [all-products (xml-> data :products :product)
      red-section (xml1-> data :products :product [:section "Red Section"])
      img2 (xml-> data :products :product [:images [:image "img2.jpg"]])]
  {:all-products (map (fn [product] (xml1-> product :section text)) all-products)
   :red-section (xml1-> red-section :section text)
   :img2 (map (fn [product] (xml1-> product :section text)) img2)})

=> {:all-products ("Red Section" "Blue Section" "Green Section"),
    :red-section "Red Section",
    :img2 ("Red Section" "Green Section")}

答案 2 :(得分:3)

您可以使用clj-xpath

等库

答案 3 :(得分:1)

The Tupelo library可以使用tupelo.forest树数据结构轻松解决此类问题。请see this question for more information。 API文档can be found here

这里我们加载你的xml数据并将其首先转换为enlive,然后转换为tupelo.forest使用的本机树结构。 Libs&amp;数据def:

(ns tst.tupelo.forest-examples
  (:use tupelo.forest tupelo.test )
  (:require
    [clojure.data.xml :as dx]
    [clojure.java.io :as io]
    [clojure.set :as cs]
    [net.cgrand.enlive-html :as en-html]
    [schema.core :as s]
    [tupelo.core :as t]
    [tupelo.string :as ts]))
(t/refer-tupelo)

(def xml-str-prod "<data>
                    <products>
                      <product>
                        <section>Red Section</section>
                        <images>
                          <image>img.jpg</image>
                          <image>img2.jpg</image>
                        </images>
                      </product>
                      <product>
                        <section>Blue Section</section>
                        <images>
                          <image>img.jpg</image>
                          <image>img3.jpg</image>
                        </images>
                      </product>
                      <product>
                        <section>Green Section</section>
                        <images>
                          <image>img.jpg</image>
                          <image>img2.jpg</image>
                        </images>
                      </product>
                    </products>
                  </data> " )

和初始化代码:

(dotest
  (with-forest (new-forest)
    (let [enlive-tree          (->> xml-str-prod
                                 java.io.StringReader.
                                 en-html/html-resource
                                 first)
          root-hid             (add-tree-enlive enlive-tree)
          tree-1               (hid->hiccup root-hid)

hid后缀代表“Hex ID”,它是唯一的十六进制值,就像一个指向树中节点/叶子的指针。在这个阶段,我们刚刚将数据加载到林数据结构中,创建了树形图1,它看起来像:

[:data
 [:tupelo.forest/raw "\n                    "]
 [:products
  [:tupelo.forest/raw "\n                      "]
  [:product
   [:tupelo.forest/raw "\n                        "]
   [:section "Red Section"]
   [:tupelo.forest/raw "\n                        "]
   [:images
    [:tupelo.forest/raw "\n                          "]
    [:image "img.jpg"]
    [:tupelo.forest/raw "\n                          "]
    [:image "img2.jpg"]
    [:tupelo.forest/raw "\n                        "]]
   [:tupelo.forest/raw "\n                      "]]
  [:tupelo.forest/raw "\n                      "]
  [:product
   [:tupelo.forest/raw "\n                        "]
   [:section "Blue Section"]
   [:tupelo.forest/raw "\n                        "]
   [:images
    [:tupelo.forest/raw "\n                          "]
    [:image "img.jpg"]
    [:tupelo.forest/raw "\n                          "]
    [:image "img3.jpg"]
    [:tupelo.forest/raw "\n                        "]]
   [:tupelo.forest/raw "\n                      "]]
  [:tupelo.forest/raw "\n                      "]
  [:product
   [:tupelo.forest/raw "\n                        "]
   [:section "Green Section"]
   [:tupelo.forest/raw "\n                        "]
   [:images
    [:tupelo.forest/raw "\n                          "]
    [:image "img.jpg"]
    [:tupelo.forest/raw "\n                          "]
    [:image "img2.jpg"]
    [:tupelo.forest/raw "\n                        "]]
   [:tupelo.forest/raw "\n                      "]]
  [:tupelo.forest/raw "\n                    "]]
 [:tupelo.forest/raw "\n                   "]]

我们接下来用以下代码删除所有空白字符串:

blank-leaf-hid?      (fn [hid] (and (leaf-hid? hid) ; ensure it is a leaf node
                                 (let [value (hid->value hid)]
                                      (and (string? value)
                                        (or (zero? (count value)) ; empty string
                                          (ts/whitespace? value)))))) ; all whitespace string

blank-leaf-hids      (keep-if blank-leaf-hid? (all-hids))
>>                   (apply remove-hid blank-leaf-hids)
tree-2               (hid->hiccup root-hid)

生成更好的结果树(打嗝格式)

[:data
 [:products
  [:product
   [:section "Red Section"]
   [:images [:image "img.jpg"] [:image "img2.jpg"]]]
  [:product
   [:section "Blue Section"]
   [:images [:image "img.jpg"] [:image "img3.jpg"]]]
  [:product
   [:section "Green Section"]
   [:images [:image "img.jpg"] [:image "img2.jpg"]]]]]

以下代码然后计算上述三个问题的答案:

product-hids         (find-hids root-hid [:** :product])
product-trees-hiccup (mapv hid->hiccup product-hids)

img2-paths           (find-paths-leaf root-hid [:data :products :product :images :image] "img2.jpg")
img2-prod-paths      (mapv #(drop-last 2 %) img2-paths)
img2-prod-hids       (mapv last img2-prod-paths)
img2-trees-hiccup    (mapv hid->hiccup img2-prod-hids)

red-sect-paths       (find-paths-leaf root-hid [:data :products :product :section] "Red Section")
red-prod-paths       (mapv #(drop-last 1 %) red-sect-paths)
red-prod-hids        (mapv last red-prod-paths)
red-trees-hiccup     (mapv hid->hiccup red-prod-hids)]

结果:

 (is= product-trees-hiccup
   [[:product
     [:section "Red Section"]
     [:images
      [:image "img.jpg"]
      [:image "img2.jpg"]]]
    [:product
     [:section "Blue Section"]
     [:images
      [:image "img.jpg"]
      [:image "img3.jpg"]]]
    [:product
     [:section "Green Section"]
     [:images
      [:image "img.jpg"]
      [:image "img2.jpg"]]]] )

(is= img2-trees-hiccup
  [[:product
    [:section "Red Section"]
    [:images
     [:image "img.jpg"]
     [:image "img2.jpg"]]]
   [:product
    [:section "Green Section"]
    [:images
     [:image "img.jpg"]
     [:image "img2.jpg"]]]])

(is= red-trees-hiccup
  [[:product
    [:section "Red Section"]
    [:images
     [:image "img.jpg"]
     [:image "img2.jpg"]]]]))))

可以找到完整示例in the forest-examples unit test

答案 4 :(得分:0)

在许多情况下,线程优先宏以及clojures映射和向量语义是访问xml的适当语法。在许多情况下,您需要更具体的xml(如xpath库),但在许多情况下,现有语言几乎同样简洁,不添加任何依赖项。

(pprint (-> (xml/parse "/tmp/xml") 
        :content first :content second :content first :content first))
"Blue Section"