以下是我在选择器向量中直接使用html/text
的示例。
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a html/text]))
调用(parse-test)
会返回包含黑客新闻标题的数据结构:
("In emergency cases a passenger was selected and thrown out of the plane. [2004]"
"“Nobody expects privacy online”: Wrong."
"The SCUMM Diary: Stories behind one of the greatest game engines ever made" ...)
酷!
是否可以使用自定义函数结束选择器向量,该函数可以返回文章URL列表。
类似于:[:td.title :a #(str "https://news.ycombinator.com/" (:href (:attrs %)))]
修改
这是实现这一目标的一种方法。我们可以编写自己的select函数:
(defn select+ [coll selector+]
(map
(peek selector+)
(html/select
(fetch-url "https://news.ycombinator.com/")
(pop selector+))))
(def href
(fn [node] (:href (:attrs node))))
(defn parse-test []
(select+
(fetch-url "https://news.ycombinator.com/")
[:td.title :a href]))
(parse-test)
答案 0 :(得分:2)
正如您在评论中所建议的那样,我认为将节点的选择和转换分开是最清晰的。
Enlive本身提供选择器和变换器。用于查找节点的选择器和用于转换它们的变换器。如果您的预期输出是html,您可以使用选择器和变换器的组合来实现您想要的结果。
然而,看到你正在寻找数据(可能是一系列地图?) - 你可以跳过变换位,只使用序列理解,如下所示:
(defn parse-test []
(for [s (html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a])]
{:title (first (:content s))
:link (:href (:attrs s))}))
(take 2 (parse-test))
;; => ({:title " \tStartup - Bill Watterson, a cartoonist's advice ",
:link "http://www.zenpencils.com/comic/128-bill-watterson-a-cartoonists-advice"}
{:title "Drug Agents Use Vast Phone Trove Eclipsing N.S.A.’s",
:link "http://www.nytimes.com/2013/09/02/us/drug-agents-use-vast-phone-trove-eclipsing-nsas.html?hp&_r=0&pagewanted=all"})