如何从Racket中的html中提取元素?

时间:2015-01-28 15:18:49

标签: scheme racket

我想在reddit中提取网址,我的代码是

#lang racket

(require net/url)
(require html)

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))

(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))

(close-input-port in)

上面的内容0是

(element
 (location 0 0 15)
 (location 0 0 82)
...

我想知道如何从中提取特定内容。

1 个答案:

答案 0 :(得分:5)

  1. 通常以x-expressions而不是html模块的struct来处理HTML会更方便。

  2. 此外,您应该使用call/input-url来自动关闭端口。

  3. 您可以通过定义read-html-as-xexpr函数并将其用作以下内容来结合这两种方法:

    #lang racket/base
    
    (require html
             net/url
             xml)
    
    (define (read-html-as-xexpr in) ;; input-port? -> xexpr?
      (caddr
       (xml->xexpr
        (element #f #f 'root '()
                 (read-html-as-xml in)))))
    
    (define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
    
    (call/input-url reddit
                    get-pure-port
                    read-html-as-xexpr)
    

    这将返回一个大的x表达式,如:

    '(html
      ((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
      (head
       ()
       (title () "programming: search results")
       (meta
        ((content " reddit, reddit.com, vote, comment, submit ")
         (name "keywords")))
       (meta
        ((content "reddit: the front page of the internet") (name "description")))
       (meta ((content "origin") (name "referrer")))
       (meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
    ... snip ...
    

    如何提取具体内容?

    • 对于我不希望整体结构发生变化的简单HTML,我通常会使用match

    • 然而,更正确,更健壮的方法是使用xml/path module



    更新:我通过询问提取网址,发现您的问题已经开始了。以下是更新的示例,使用se-path*/list获取所有href元素的所有<a>属性:

    #lang racket/base
    
    (require html
             net/url
             xml
             xml/path)
    
    (define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
      (caddr
       (xml->xexpr
        (element #f #f 'root '()
                 (read-html-as-xml in)))))
    
    (define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
    
    (define xe (call/input-url reddit
                               get-pure-port
                               read-html-as-xexprs))
    
    (se-path*/list '(a #:href) xe)
    

    结果:

    '("#content"
      "http://www.reddit.com/r/announcements/"
      "http://www.reddit.com/r/Art/"
      "http://www.reddit.com/r/AskReddit/"
      "http://www.reddit.com/r/askscience/"
      "http://www.reddit.com/r/aww/"
      "http://www.reddit.com/r/blog/"
      "http://www.reddit.com/r/books/"
      "http://www.reddit.com/r/creepy/"
      "http://www.reddit.com/r/dataisbeautiful/"
      "http://www.reddit.com/r/DIY/"
      "http://www.reddit.com/r/Documentaries/"
      "http://www.reddit.com/r/EarthPorn/"
      "http://www.reddit.com/r/explainlikeimfive/"
      "http://www.reddit.com/r/Fitness/"
      "http://www.reddit.com/r/food/"
      ... snip ...