我通过编写网络蜘蛛来学习。我试图从allpages.com
获取所有业务类别的列表。
以下是我的整个计划。不幸的是,我无法隔离这个问题,所以我已经粘贴了所有这些。
如果您运行此程序,您将首先看到它正确下载第一页,并将所有提取的类别添加到类别列表中。
然而,当它随后下载后续页面时,它似乎搞乱了对父类别的引用。例如。它错误地计算了网址http://www.allpages.com/travel-tourism/political-ideological-organizations/
,而实际上political-ideological-organizations/
不是travel-tourism/
的子类别。挖掘日志似乎会覆盖parent
对象中的数据。这里的工人越多,错误越明显。
在我开始通过引用goroutine传递数据之前,这工作得更好一些,但我的问题基本相同。
我有几个问题:
出了什么问题/为什么它不起作用以及如何解决?
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"log"
"strconv"
"strings"
"regexp"
)
const domain = "http://www.allpages.com/"
const categoryPage = "category.html"
type Category struct {
url string
level uint
name string
entries int
parent *Category
}
type DownloadResult struct {
doc *goquery.Document
category *Category
}
const WORKERS = 2
const SEPARATOR = "§§§"
func main() {
allCategories := make([]Category, 0)
downloadChannel := make(chan *Category)
resultsChannel := make(chan *DownloadResult, 100)
for w := 1; w <= WORKERS; w++ {
go worker(downloadChannel, resultsChannel)
}
numRequests := 1
downloadChannel <- &Category{ domain + categoryPage, 0, "root", 0, nil }
for result := range resultsChannel {
var extractor func(doc *goquery.Document) []string
if result.category.level == 0 {
extractor = topLevelExtractor
} else if result.category.level == 1 {
extractor = secondLevelExtractor
} else {
extractor = thirdLevelExtractor
}
categories := extractCategories(result.doc, result.category, extractor)
allCategories = append(allCategories, *categories...)
//fmt.Printf("Appending categories: %v", *categories)
fmt.Printf("total categories = %d, total requests = %d\n", len(allCategories), numRequests)
for _, category := range *categories {
numRequests += 1
downloadChannel <- &category
}
// close the channels when there are no more jobs
if len(allCategories) > numRequests {
close(downloadChannel)
close(resultsChannel)
}
}
fmt.Println("Done")
}
func worker(downloadChannel <-chan *Category, results chan<- *DownloadResult) {
for target := range downloadChannel {
fmt.Printf("Downloading %v (addr %p) ...", target, &target)
doc, err := goquery.NewDocument(target.url)
if err != nil {
log.Fatal(err)
panic(err)
}
fmt.Print("done \n")
results <- &DownloadResult{doc, target}
}
}
func extractCategories(doc *goquery.Document, parent *Category, extractor func(doc *goquery.Document) []string) *[]Category {
numberRegex, _ := regexp.Compile("[0-9,]+")
log.Printf("Extracting subcategories for page %s\n", parent)
subCategories := extractor(doc)
categories := make([]Category, 0)
for _, subCategory := range subCategories {
log.Printf("Got subcategory=%s from parent=%s", subCategory, parent)
extracted := strings.Split(subCategory, SEPARATOR)
numberWithComma := numberRegex.FindString(extracted[2])
number := strings.Replace(numberWithComma, ",", "", -1)
numRecords, err := strconv.Atoi(number)
if err != nil {
log.Fatal(err)
panic(err)
}
var category Category
level := parent.level + 1
if parent.level == 0 {
category = Category{ domain + extracted[1], level, extracted[0], numRecords, parent }
} else {
log.Printf("category URL=%s, parent=%s, parent=%v", extracted[1], parent.url, parent)
category = Category{ parent.url + extracted[1], level, extracted[0], numRecords, parent }
}
log.Printf("Appending category=%v (pointer=%p)", category, &category)
categories = append(categories, category)
}
return &categories
}
func topLevelExtractor(doc *goquery.Document) []string {
return doc.Find(".cat-listings-td .c-1s-2m-1-td1").Map(func(i int, s *goquery.Selection) string {
title := s.Find("a").Text()
url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
v, _ := a.Attr("href")
return v
})
records := s.Clone().Children().Remove().End().Text()
//log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)
res := []string{title, url[0], records}
return strings.Join(res, SEPARATOR)
})
}
func secondLevelExtractor(doc *goquery.Document) []string {
return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
title := s.Find("a").Text()
url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
v, _ := a.Attr("href")
return v
})
records := s.Clone().Children().Remove().End().Text()
//log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)
res := []string{title, url[0], records}
return strings.Join(res, SEPARATOR)
})
}
func thirdLevelExtractor(doc *goquery.Document) []string {
return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
title := s.Find("a").Text()
url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
v, _ := a.Attr("href")
return v
})
records := s.Clone().Children().Remove().End().Text()
//log.Printf("Item %d: %s, %s - %s\n", i, title, records, url)
res := []string{title, url[0], records}
return strings.Join(res, SEPARATOR)
})
}
更新 修正 - 请参阅下面的评论。
答案 0 :(得分:0)
循环:
NSArray *postIds = [results map:^NSString*(NSDictionary* post) {
return [post objectForKey:@"post_id"];
}];
意味着我正在向通道发送对临时变量 for _, category := range *categories {
numRequests += 1
downloadChannel <- &category
}
的引用,而不是该值的实际内存地址。
我通过使用不同的循环解决了这个问题:
category