Question

我正在研究一个小型网络刮刀，以便感受一下golang。它目前正从一张桌子上抓取一个wiki的信息，然后专门从细胞中获取信息。我目前没有代码（目前不在家），但它看起来非常相似：

    func main() {
        doc, err := goquery.NewDocument("http://monsterhunter.wikia.com/wiki/MH4:_Item_List")
        if err != nil {
                log.Fatal(err)
        }

        doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
                title := s.Find("td").Text()
                fmt.Printf(title)
        })
}

问题是在这个网站上第一个单元格是图像，所以它打印出我不想要的图像源。如何忽略大表每行中的第一个单元格？

Answer 1

让我们清楚一些事情。 Selection是符合某些条件的节点集合。

doc.Find()为Selection.Find()，返回包含符合条件的元素的新Selection。并且对集合的每个元素进行Selection.Each()次迭代，并调用传递给它的函数值。

因此，在您的情况下，Find("tbody")会找到所有tbody元素，Each()将遍历所有tbody元素并调用您的匿名函数。

在您的匿名函数s中，Selection是一个tbody元素。您致电s.Find("td")，它会返回一个新的Selection，其中包含所有当前表格的td元素。因此，当您对此调用Text()时，它将是每个td元素（包括其后代）的组合文本内容。这不是你想要的。

您应该在Each()返回的Selection上拨打另一个s.Find("td")。并检查传递给第二个匿名函数的Selection是否有img个孩子。

示例代码：

doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
    // s here is a tbody element
    s.Find("td").Each(func(j int, s2 *goquery.Selection) {
        // s2 here is a td element
        if s3 := s2.Find("img"); s3 != nil && s3.Length() > 0 {
            return // This TD has at least one img child, skip it
        }
        fmt.Printf(s2.Text())
    })
})

或者，您可以搜索tr个元素，并通过检查传递给第3个匿名函数的索引是否为td（第一个孩子）来跳过每行的第一个0子元素，类似于这样：

doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
    // s here is a tbody element
    s.Find("tr").Each(func(j int, s2 *goquery.Selection) {
        // s2 here is a tr element
        s2.Find("td").Each(func(k int, s3 *goquery.Selection) {
            // s3 here is a td element
            if k == 0 {
                return // This is the first TD in the row
            }
            fmt.Printf(s3.Text())
        })
    })
})

golang web scraper，忽略表的特定单元格

1 个答案: