Question

我尝试创建从HTML页面中抓取和标记的函数，我将其URL提供给一个函数，这样就可以了。我得到<h3>和<table>元素的序列，当我尝试使用select函数从结果序列中仅提取表或h3标记时，我得到（），或者如果我尝试映射那些我得到的标签（nil nil nil ...）。

请你帮我解决这个问题，或者解释一下我做错了什么？

以下是代码：

(ns Test2 
  (:require [net.cgrand.enlive-html :as html]) 
  (:require [clojure.string :as string])) 

(defn get-page 
  "Gets the html page from passed url" 
  [url] 
  (html/html-resource (java.net.URL. url))) 

(defn h3+table       
    "returns sequence of <h3> and <table> tags"
  [url] 
  (html/select (get-page url) 
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3] 
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]} 
               )) 

(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")

这条线让我很头疼：

(html/select (h3+table url) [:table])

你能告诉我我做错了什么吗？

只是为了澄清我的问题：是否可以使用enlive的select函数从（h3 + table url）的结果中仅提取表标签？

Answer 1

正如@Julien所指出的，你可能不得不使用在原始html上应用(html/select raw-html selectors)时得到的深层嵌套树结构。您似乎尝试多次应用html/select，但这不起作用。 html/select将html解析为clojure数据结构，因此您无法再将其应用于该数据结构。

我发现解析网站实际上有点涉及，但我认为它可能是多方法的一个很好的用例，所以我一起攻击了一些东西，也许这会让你开始：

（这里的代码很难看，你也可以查看这个gist）

(ns tutorial.scrape1
  (:require [net.cgrand.enlive-html :as html]))

(def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show")

(defn get-page [url] 
  (html/html-resource (java.net.URL. url))) 

(defn content->string [content]
  (cond
   (nil? content)    ""
   (string? content) content
   (map? content)    (content->string (:content content))
   (coll? content)   (apply str (map content->string content))
   :else             (str content)))

(derive clojure.lang.PersistentStructMap ::Map)
(derive clojure.lang.PersistentArrayMap  ::Map)
(derive java.lang.String                 ::String)
(derive clojure.lang.ISeq                ::Collection)
(derive clojure.lang.PersistentList      ::Collection)
(derive clojure.lang.LazySeq             ::Collection)

(defn tag-type [node]
  (case (:tag node) 
   :tr    ::CompoundNode
   :table ::CompoundNode
   :th    ::TerminalNode
   :td    ::TerminalNode
   :h3    ::TerminalNode
   :tbody ::IgnoreNode
   ::IgnoreNode))

(defmulti parse-node
  (fn [node]
    (let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)])))

(defmethod parse-node [::Map ::TerminalNode] [node]
  (content->string (:content node)))
(defmethod parse-node [::Map ::CompoundNode] [node]
  (map parse-node (:content node)))
(defmethod parse-node [::Map ::IgnoreNode] [node]
  (parse-node (:content node)))
(defmethod parse-node [::String nil] [node]
  node)
(defmethod parse-node [::Collection nil] [node]
  (map parse-node node))

(defn h3+table [url] 
 (let [ws-content (get-page url)
       h3s+tables (html/select ws-content #{[:div#prospekt_container :h3]
                                            [:div#prospekt_container :table]})]
   (for [node h3s+tables] (parse-node node))))

关于发生了什么的几句话：

content->string获取数据结构并将其内容收集到一个字符串中并返回该内容，以便您可以将其应用于可能仍包含您要忽略的嵌套子标记（如<br/>）的内容。 / p>

derive语句建立一个ad hoc层次结构，我们稍后将在多方法解析节点中使用它。这很方便，因为我们从未完全知道我们将要遇到哪些数据结构，以后我们可以轻松添加更多案例。

tag-type函数实际上是一个模仿层次结构语句的hack - AFAIK你不能用非命名空间限定的关键字创建一个层次结构，所以我这样做了。

多方法parse-node在节点的类上调度，如果节点是另一个tag-type上的地图。

现在我们要做的就是定义适当的方法：如果我们在终端节点，我们将内容转换为字符串，否则我们要么重复内容，要么映射我们的集合上的解析节点函数重新处理。实际上甚至没有使用::String的方法，但为了安全起见，我将其保留了下来。

h3+table函数几乎就是你之前的函数，我稍微简化了选择器并将它们放入一个集合中，不确定是否将它们按照预期的方式放入地图中。

快乐刮！

Answer 2

您的问题很难理解，但我认为您的最后一行应该只是

(h3+table url)

这将返回一个包含已删除HTML的深层嵌套数据结构，然后您可以使用通常的Clojure序列API进行深入研究。祝你好运。

使用Enlive重新分析数据

2 个答案: