Question

我正在构建一个抓取网址的抓取工具，从中提取链接，并将其中的每一个访问到一定深度;在特定网站上制作路径树。

我为此抓取工具实现并行性的方式是，我会在找到这样的新网址时立即访问：

func main() {
    link := "https://example.com"

    wg := new(sync.WaitGroup)
    wg.Add(1)

    q := make(chan string)
    go deduplicate(q, wg)
    q <- link
    wg.Wait()
}

func deduplicate(ch chan string, wg *sync.WaitGroup) {
    for link := range ch {
        // seen is a global variable that holds all seen URLs
        if seen[link] {
            wg.Done()
            continue
        }
        seen[link] = true
        go crawl(link, ch, wg)
    }
}

func crawl(link string, q chan string, wg *sync.WaitGroup) {
    // handle the link and create a variable "links" containing the links found inside the page
    wg.Add(len(links))
    for _, l := range links {
        q <- l}
    }
}

这适用于相对较小的网站，但是当我在大型网站上运行它时，我会开始在某些请求中收到以下两个错误之一：socket: too many open files和no such host （主人确实在那里）。

处理此问题的最佳方法是什么？我应该检查这些错误并暂停执行，直到其他请求完成为止？或者在特定时间指定可能的最大请求数？（这对我来说更有意义，但不确定如何准确编码）

Answer 1

错误socket: too many open files中引用的文件包括线程和套接字（用于加载正在抓取的网页的http请求）。请参阅此symbolic。

由于无法创建文件，DNS查询也很可能失败，但报告的错误为no such host。

问题可以通过两种方式解决：

1) Increase the maximum number of open file handles
2) Limit the maximum number of concurrent `crawl` calls

1）这是最简单的解决方案，但可能并不理想，因为它只会推迟问题，直到找到一个网站上有更多链接的新限制。对于Linux，可以使用ulimit -n设置此限制。

2）更多的是设计问题。我们需要限制可以同时进行的http请求的数量。我稍微修改了一下代码。最重要的变化是maxGoRoutines。每次开始的抓取调用都会将值插入到通道中。一旦频道已满，下一个呼叫将被阻止，直到从频道中删除一个值。每次抓取调用结束时，都会从通道中删除一个值。

package main

import (
    "fmt"
    "sync"
    "time"
)

func main() {
    link := "https://example.com"

    wg := new(sync.WaitGroup)
    wg.Add(1)

    q := make(chan string)
    go deduplicate(q, wg)
    q <- link
    fmt.Println("waiting")
    wg.Wait()
}

//This is the maximum number of concurrent scraping calls running
var MaxCount = 100
var maxGoRoutines = make(chan struct{}, MaxCount)

func deduplicate(ch chan string, wg *sync.WaitGroup) {
    seen := make(map[string]bool)
    for link := range ch {
        // seen is a global variable that holds all seen URLs
        if seen[link] {
            wg.Done()
            continue
        }
        seen[link] = true
        wg.Add(1)
        go crawl(link, ch, wg)
    }
}

func crawl(link string, q chan string, wg *sync.WaitGroup) {
    //This allows us to know when all the requests are done, so that we can quit
    defer wg.Done()

    links := doCrawl(link)

    for _, l := range links {
        q <- l
    }
}

func doCrawl(link string) []string {
    //This limits the maximum number of concurrent scraping requests
    maxGoRoutines <- struct{}{}
    defer func() { <-maxGoRoutines }()

    // handle the link and create a variable "links" containing the links found inside the page
    time.Sleep(time.Second)
    return []string{link + "a", link + "b"}
}

什么是最好的处理方式＆＃34;太多的打开文件＆＃34;？

1 个答案: