被取代的对象

时间:2017-02-08 12:28:22

标签: go

我通过编写网络蜘蛛来学习。我试图从allpages.com获取所有业务类别的列表。

以下是我的整个计划。不幸的是,我无法隔离这个问题,所以我已经粘贴了所有这些。

如果您运行此程序,您将首先看到它正确下载第一页,并将所有提取的类别添加到类别列表中。

然而,当它随后下载后续页面时,它似乎搞乱了对父类别的引用。例如。它错误地计算了网址http://www.allpages.com/travel-tourism/political-ideological-organizations/,而实际上political-ideological-organizations/不是travel-tourism/的子类别。挖掘日志似乎会覆盖parent对象中的数据。这里的工人越多,错误越明显。

在我开始通过引用goroutine传递数据之前,这工作得更好一些,但我的问题基本相同。

我有几个问题:

  1. 如何在不通过选择日志行的情况下调试它?
  2. 出了什么问题/为什么它不起作用以及如何解决?

    package main
    
    import (
            "fmt"
            "github.com/PuerkitoBio/goquery"
            "log"
            "strconv"
            "strings"
            "regexp"
    )
    
    const domain = "http://www.allpages.com/"
    const categoryPage = "category.html"
    
    type Category struct {
            url string
            level uint
            name string
            entries int
            parent *Category
    }
    
    type DownloadResult struct {
            doc *goquery.Document
            category *Category
    }
    
    const WORKERS = 2
    const SEPARATOR = "§§§"
    
    func main() {
    
            allCategories := make([]Category, 0)
    
            downloadChannel := make(chan *Category)
            resultsChannel := make(chan *DownloadResult, 100)
    
            for w := 1; w <= WORKERS; w++ {
                    go worker(downloadChannel, resultsChannel)
            }
    
            numRequests := 1
            downloadChannel <- &Category{ domain + categoryPage, 0, "root", 0, nil }
    
            for result := range resultsChannel {
                    var extractor func(doc *goquery.Document) []string
    
                    if result.category.level == 0 {
                            extractor = topLevelExtractor
                    } else if result.category.level == 1 {
                            extractor = secondLevelExtractor
                    } else {
                            extractor = thirdLevelExtractor
                    }
    
                    categories := extractCategories(result.doc, result.category, extractor)
                    allCategories = append(allCategories, *categories...)
    
                    //fmt.Printf("Appending categories: %v", *categories)
    
                    fmt.Printf("total categories = %d, total requests = %d\n", len(allCategories), numRequests)
    
                    for _, category := range *categories {
                            numRequests += 1
                            downloadChannel <- &category
                    }
    
                    // close the channels when there are no more jobs
                    if len(allCategories) > numRequests {
                            close(downloadChannel)
                            close(resultsChannel)
                    }
            }
    
            fmt.Println("Done")
    }
    
    func worker(downloadChannel <-chan *Category, results chan<- *DownloadResult) {
            for target := range downloadChannel {
                    fmt.Printf("Downloading %v (addr %p) ...", target, &target)
    
                    doc, err := goquery.NewDocument(target.url)
                    if err != nil {
                            log.Fatal(err)
                            panic(err)
                    }
    
                    fmt.Print("done \n")
    
                    results <- &DownloadResult{doc, target}
            }
    }
    
    func extractCategories(doc *goquery.Document, parent *Category, extractor func(doc *goquery.Document) []string) *[]Category {
    
            numberRegex, _ := regexp.Compile("[0-9,]+")
    
            log.Printf("Extracting subcategories for page %s\n", parent)
    
            subCategories := extractor(doc)
    
            categories := make([]Category, 0)
    
            for _, subCategory := range subCategories {
                    log.Printf("Got subcategory=%s from parent=%s", subCategory, parent)
                    extracted := strings.Split(subCategory, SEPARATOR)
    
                    numberWithComma := numberRegex.FindString(extracted[2])
                    number := strings.Replace(numberWithComma, ",", "", -1)
    
                    numRecords, err := strconv.Atoi(number)
                    if err != nil {
                            log.Fatal(err)
                            panic(err)
                    }
    
                    var category Category
    
                    level := parent.level + 1
    
                    if parent.level == 0 {
                            category = Category{ domain + extracted[1], level, extracted[0], numRecords, parent }
                    } else {
                            log.Printf("category URL=%s, parent=%s, parent=%v", extracted[1], parent.url, parent)
                            category = Category{ parent.url + extracted[1], level, extracted[0], numRecords, parent }
                    }
    
                    log.Printf("Appending category=%v (pointer=%p)", category, &category)
    
                    categories = append(categories, category)
            }
    
            return &categories
    }
    
    func topLevelExtractor(doc *goquery.Document) []string {
            return doc.Find(".cat-listings-td .c-1s-2m-1-td1").Map(func(i int, s *goquery.Selection) string {
                    title := s.Find("a").Text()
                    url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
                            v, _ := a.Attr("href")
                            return v
                    })
                    records := s.Clone().Children().Remove().End().Text()
    
                    //log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)
    
                    res := []string{title, url[0], records}
                    return strings.Join(res, SEPARATOR)
            })
    }
    
    func secondLevelExtractor(doc *goquery.Document) []string {
            return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
                    title := s.Find("a").Text()
                    url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
                            v, _ := a.Attr("href")
                            return v
                    })
                    records := s.Clone().Children().Remove().End().Text()
    
                    //log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)
    
                    res := []string{title, url[0], records}
                    return strings.Join(res, SEPARATOR)
            })
    }
    
    func thirdLevelExtractor(doc *goquery.Document) []string {
            return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
                    title := s.Find("a").Text()
                    url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
                            v, _ := a.Attr("href")
                            return v
                    })
                    records := s.Clone().Children().Remove().End().Text()
    
                    //log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)
    
                    res := []string{title, url[0], records}
                    return strings.Join(res, SEPARATOR)
            })
    }
    
  3. 更新 修正 - 请参阅下面的评论。

1 个答案:

答案 0 :(得分:0)

循环:

NSArray *postIds = [results map:^NSString*(NSDictionary* post) {
                       return [post objectForKey:@"post_id"];
                   }];

意味着我正在向通道发送对临时变量 for _, category := range *categories { numRequests += 1 downloadChannel <- &category } 的引用,而不是该值的实际内存地址。

我通过使用不同的循环解决了这个问题:

category