我有一个html页面,有一个结构我想变成Clojure数据结构。我正试图以一种惯用的方式解决这个问题
这是我的结构:
<div class=“group”>
<h2>title1<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>
我想要的结构:
'(
[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]
)
标题的重复是故意的。
我看过David Nolan’s enlive tutorial。如果组和子组之间存在奇偶校验,那么这提供了一个很好的解决方案,但在这种情况下,它可以是随机的。
感谢您的任何建议。
答案 0 :(得分:3)
您可以使用Hickory进行解析,然后Clojure有一些非常好的工具可以将解析后的HTML转换为您想要的表单:
(require '[hickory.core :as html])
(defn classifier [tag klass]
(comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))
(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))
(defn only [x]
;; https://stackoverflow.com/a/14792289/5044950
{:pre [(seq x)
(nil? (next x))]}
(first x))
(defn identifier [tag element]
(->> element :content (filter (identifier? tag)) only :content only))
(defn process [data]
(for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
:let [title (identifier :h2 group)]
subgroup (filter subgroup? (:content group))
:let [subheading (identifier :h3 subgroup)]
path (filter path? (:content subgroup))]
[title subheading (:href (:attrs path))]))
示例:
(require '[clojure.pprint :as pprint])
(def data
"<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>")
(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;; ["title1" "subheading2" "“path2”"]
;; ["title2" "subheading3" "“path3”"])
答案 1 :(得分:0)
解决方案可分为两部分
答案 2 :(得分:0)
您可以使用the tupelo.forest
library解决此问题。这是一个带注释的单元测试,显示了该方法。您可以找到更多信息in the API docs以及the unit tests和the example demos。其他文件即将发布。
(dotest
(with-forest (new-forest)
(let [html-str "<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>"
enlive-tree (->> html-str
java.io.StringReader.
en-html/html-resource
first)
root-hid (add-tree-enlive enlive-tree)
tree-1 (hid->hiccup root-hid)
; Removing whitespace nodes is optional; just done to keep things neat
blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
[hid]
(let [node (hid->node hid)]
(and (contains-key? node ::tf/value)
(ts/whitespace? (grab ::tf/value node)))))
blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
>> (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
tree-2 (hid->hiccup root-hid)
>> (is= tree-2 [:html
[:body
[:div {:class "“group”"}
[:h2 "title1"]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading1"]
[:a {:href "“path1”"}]]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading2"]
[:a {:href "“path2”"}]]]
[:div {:class "“group”"}
[:h2 "title2"]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading3"]
[:a {:href "“path3”"}]]]]])
; find consectutive nested [:div :h2] pairs at any depth in the tree
div-h2-paths (find-paths root-hid [:** :div :h2])
>> (is= (format-paths div-h2-paths)
[[{:tag :html}
[{:tag :body}
[{:class "“group”", :tag :div}
[{:tag :h2, :tupelo.forest/value "title1"}]]]]
[{:tag :html}
[{:tag :body}
[{:class "“group”", :tag :div}
[{:tag :h2, :tupelo.forest/value "title2"}]]]]])
; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
div-hids (mapv #(idx % -2) div-h2-paths)
; for each of div-hids, find and collect nested :h3 values
dif-h3-paths (vec
(lazy-gen
(doseq [div-hid div-hids]
(let [h2-value (find-leaf-value div-hid [:div :h2])
h3-paths (find-paths div-hid [:** :h3])
h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
(doseq [h3-value h3-values]
(yield [h2-value h3-value]))))))
]
(is= dif-h3-paths
[["title1" "subheading1"]
["title1" "subheading2"]
["title2" "subheading3"]])
)))