专家F#网络爬虫示例

时间:2009-06-02 13:55:50

标签: f# ctp

我正在尝试使用基于 v1.9.2 Expert F#中的示例,但之后的CTP版本已经发生了足够的变化,他们甚至不再编译了。

我在列出13-13时遇到了一些麻烦。这是urlCollector对象定义的片段:

let urlCollector =
    MailboxProcessor.Start(fun self ->
        let rec waitForUrl (visited : Set<string>) =
            async { if visited.Count < limit then
                        let! url = self.Receive()
                        if not (visited.Contains(url)) then
                            do! Async.Start
                                (async { let! links = collectLinks url
                                         for link in links do
                                         do self <-- link })

                        return! waitForUrl(visited.Add(url)) }

            waitForUrl(Set.Empty))

我正在使用版本 1.9.6.16 进行编译,编译器就这样抱怨:

  1. 此时或之前的不完整结构化构造 表达[在最后一个paren之后]
  2. 这个'let'的返回表达式中出现
  3. 错误。可能的错误缩进[指定义waitForUrl]
  4. 有人能发现这里出了什么问题吗?

1 个答案:

答案 0 :(得分:3)

看起来最后一行需要缩进4个空格。

编辑:实际上,看起来这里还有更多。假设这是与here相同的样本,那么这是我刚修改为与1.9.6.16版本同步的版本:

open System.Collections.Generic
open System.Net
open System.IO
open System.Threading
open System.Text.RegularExpressions

let limit = 10    

let linkPat = "href=\s*\"[^\"h]*(http://[^&\"]*)\""
let getLinks (txt:string) =
    [ for m in Regex.Matches(txt,linkPat)  -> m.Groups.Item(1).Value ]

let (<--) (mp: MailboxProcessor<_>) x = mp.Post(x)

// A type that helps limit the number of active web requests
type RequestGate(n:int) =
    let semaphore = new Semaphore(initialCount=n,maximumCount=n)
    member x.AcquireAsync(?timeout) =
        async { let! ok = semaphore.AsyncWaitOne(?millisecondsTimeout=timeout)
                if ok then
                   return
                     { new System.IDisposable with
                         member x.Dispose() =
                             semaphore.Release() |> ignore }
                else
                   return! failwith "couldn't acquire a semaphore" }

// Gate the number of active web requests
let webRequestGate = RequestGate(5)

// Fetch the URL, and post the results to the urlCollector.
let collectLinks (url:string) =
    async { // An Async web request with a global gate
            let! html =
                async { // Acquire an entry in the webRequestGate. Release
                        // it when 'holder' goes out of scope
                        use! holder = webRequestGate.AcquireAsync()

                        // Wait for the WebResponse
                        let req = WebRequest.Create(url,Timeout=5)

                        use! response = req.AsyncGetResponse()

                        // Get the response stream
                        use reader = new StreamReader(
                            response.GetResponseStream())

                        // Read the response stream
                        return! reader.AsyncReadToEnd()  }

            // Compute the links, synchronously
            let links = getLinks html

            // Report, synchronously
            do printfn "finished reading %s, got %d links" 
                    url (List.length links)

            // We're done
            return links }

let urlCollector =
    MailboxProcessor.Start(fun self ->
        let rec waitForUrl (visited : Set<string>) =
            async { if visited.Count < limit then
                        let! url = self.Receive()
                        if not (visited.Contains(url)) then
                            Async.Start 
                                (async { let! links = collectLinks url
                                         for link in links do
                                             do self <-- link })
                        return! waitForUrl(visited.Add(url)) }

        waitForUrl(Set.Empty))

urlCollector <-- "http://news.google.com"
// wait for keypress to end program
System.Console.ReadKey() |> ignore