Clojure中的URL检查器?

时间:2009-08-10 21:21:47

标签: clojure

我有一个在Perl中使用的URL检查器。我想知道如何在Clojure中完成这样的事情。我有一个包含数千个URL的文件,我希望输出文件包含URL(减去http://,https://)和简单:1表示有效,0表示false。理想情况下,考虑到这是Clojure的优势之一,我可以同时检查每个站点。

输入

http://www.google.com
http://www.cnn.com
http://www.msnbc.com
http://www.abadurlisnotgood.com

输出

www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0

4 个答案:

答案 0 :(得分:6)

我认为“有效URL”是指HTTP响应200.这可能有效。它需要clojure-contrib。将map更改为pmap以尝试使其平行,就像Arthur Ulfeldt所提到的那样。

(use '(clojure.contrib duck-streams
                       java-utils
                       str-utils))

(import '(java.net URL
                   URLConnection
                   HttpURLConnection
                   UnknownHostException))

(defn check-url [url]
  (str (re-sub #"^(?i)http:/+" "" url)
       ":"
       (try
        (let [c (cast HttpURLConnection
                      (.openConnection (URL. url)))]
          (if (= 200 (.getResponseCode c))
            1
            0))
        (catch UnknownHostException _
          0))))

(defn check-urls-from-file [filename]
  (doseq [line (map check-url
                    (read-lines (as-file filename)))]
    (println line)))

将您的示例作为输入:

user> (check-urls-from-file "urls.txt")
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0

答案 1 :(得分:3)

编写一个小函数,将“:1”或“:0”附加到网址,然后使用pmap将其并行应用于所有网址。

(defn check-a-url [url] .... )
(pmap #(if (check-a-url %) (str url ":1") (str url ":0")))

答案 2 :(得分:0)

我使用了带有send-off的代理和上述解决方案,而不是pmap。我认为当阻塞I / O时这会更好。我相信pmap也具有有限的并发性。这是我到目前为止所拥有的。我想知道这将如何扩展成千上万的URL。


(use '(clojure.contrib duck-streams
                       java-utils
                       str-utils))

(import '(java.net URL
                   URLConnection
                   HttpURLConnection
                   UnknownHostException))

(defn check-url [url]
  (str (re-sub #"^(?i)http:/+" "" url)
       ":"
       (try
        (let [c (cast HttpURLConnection
                      (.openConnection (URL. url)))]
          (if (= 200 (.getResponseCode c))
            1
            0))
        (catch UnknownHostException _
          0))))

(def urls (read-lines "urls.txt"))

(def agents (for [url urls] (agent url)))

(doseq [agent agents]
  (send-off agent check-url))

(apply await agents)

(def x '())
(doseq [url (filter deref agents)]
    (def x (cons @url x)))
(prn x)

(shutdown-agents)

答案 3 :(得分:0)

Clojure现在在as-url中有一个clojure.java.io函数:

(as-url "http://google.com") ;;=> #object[java.net.URL 0x5dedf9bd "http://google.com"]

(str (as-url "http://google.com")) ;;=> "http://google.com"

(as-url "notanurl") ;; java.net.MalformedURLException

基于此,我们可以编写如下函数:

(defn check-url
  "checks if the url is well formed"
  [url]
  (str (clojure.string/replace-first url #"(http://|https://)" "")
       ":"
       (try (as-url url) ;; built-in, does not perform an actual request, and does very little validation
            1
            (catch Exception e 0))))


(defn check-urls-from-file
  "from Brian Carper answer"
  [filename]
  (doseq [line (map check-url (read-lines (as-file filename)))]
    (println line)))