将html结构转换为Clojure结构

时间:2017-08-08 17:26:05

标签: html clojure enlive

我有一个html页面,有一个结构我想变成Clojure数据结构。我正试图以一种惯用的方式解决这个问题

这是我的结构:

<div class=“group”>
  <h2>title1<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>

我想要的结构:

'(
[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]
)

标题的重复是故意的。

我看过David Nolan’s enlive tutorial。如果组和子组之间存在奇偶校验,那么这提供了一个很好的解决方案,但在这种情况下,它可以是随机的。

感谢您的任何建议。

3 个答案:

答案 0 :(得分:3)

您可以使用Hickory进行解析,然后Clojure有一些非常好的工具可以将解析后的HTML转换为您想要的表单:

(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  ;; https://stackoverflow.com/a/14792289/5044950
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))

示例:

(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <h2>title1</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>")

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])

答案 1 :(得分:0)

解决方案可分为两部分

  • 解析:用clojure html parser或任何其他解析器解析它。
  • 自定义数据结构:修改解析后的html,如果需要,可以使用clojure.walk

答案 2 :(得分:0)

您可以使用the tupelo.forest library解决此问题。这是一个带注释的单元测试,显示了该方法。您可以找到更多信息in the API docs以及the unit teststhe example demos。其他文件即将发布。

(dotest
  (with-forest (new-forest)
    (let [html-str        "<div class=“group”>
                              <h2>title1</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading1</h3>
                                <a href=“path1” />
                              </div>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading2</h3>
                                <a href=“path2” />
                              </div>
                            </div>
                            <div class=“group”>
                              <h2>title2</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading3</h3>
                                <a href=“path3” />
                              </div>
                            </div>"

          enlive-tree     (->> html-str
                            java.io.StringReader.
                            en-html/html-resource
                            first)
          root-hid        (add-tree-enlive enlive-tree)
          tree-1          (hid->hiccup root-hid)

          ; Removing whitespace nodes is optional; just done to keep things neat
          blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
                            [hid]
                            (let [node (hid->node hid)]
                              (and (contains-key? node ::tf/value)
                                (ts/whitespace? (grab ::tf/value node)))))
          blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
          >>              (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
          tree-2          (hid->hiccup root-hid)
          >>              (is= tree-2 [:html
                                       [:body
                                        [:div {:class "“group”"}
                                         [:h2 "title1"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading1"]
                                          [:a {:href "“path1”"}]]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading2"]
                                          [:a {:href "“path2”"}]]]
                                        [:div {:class "“group”"}
                                         [:h2 "title2"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading3"]
                                          [:a {:href "“path3”"}]]]]])

          ; find consectutive nested [:div :h2] pairs at any depth in the tree
          div-h2-paths    (find-paths root-hid [:** :div :h2])
          >>              (is= (format-paths div-h2-paths)
                            [[{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title1"}]]]]
                             [{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title2"}]]]]])

          ; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
          div-hids        (mapv #(idx % -2) div-h2-paths)
          ; for each of div-hids, find and collect nested :h3 values
          dif-h3-paths    (vec
                            (lazy-gen
                              (doseq [div-hid div-hids]
                                (let [h2-value  (find-leaf-value div-hid [:div :h2])
                                      h3-paths  (find-paths div-hid [:** :h3])
                                      h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
                                  (doseq [h3-value h3-values]
                                    (yield [h2-value h3-value]))))))
          ]
      (is= dif-h3-paths
        [["title1" "subheading1"]
         ["title1" "subheading2"]
         ["title2" "subheading3"]])

      )))