Question

我正在使用Apify（一种无头的浏览器服务）来编写网络抓取crawlers，它们是Javascript。

我正在尝试收集我在博客上发布的数百篇文章的文章内容。

搜寻器通过specifying在Apify的Web界面中工作，起始页和列表页是分页索引，包含指向文章的链接以及应从此处爬网的目标文章的URL模式。

我选择的名字...

开始：https://www.example.com/author/myname
列表：https://www.example.com/author/myname/page/[ \ d +]
详细信息：https://www.example.com/[ \ d +] / [\ d +] / [a-z0-9] +（？：-[a-z0-9] +）*。html $

这是搜寻器代码...

function pageFunction(context) {

    // Called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;

    // If page is START or a LIST,
    if (context.request.label === 'START' || context.request.label === 'LIST') {

        context.skipOutput();

        // First, gather LIST page
        $('a.page-numbers').each(function() {
            context.enqueuePage({
                url: /*window.location.origin +*/ $(this).attr('href'),
                label: 'LIST'
            });
        });

        // Then, gather every DETAIL page
        $('h3>a').each(function(){
            context.enqueuePage({
                url: /*window.location.origin +*/ $(this).attr('href'),
                label: 'DETAIL'
            });
        });

    // If page is actually a DETAIL target page
    } else if (context.request.label === 'DETAIL') {

        result = {
            "title": $('h1')
        };

    }
    return result;
}

我认为这种构造可能是正确的。

在“开始”和“列表”中，这可以正确识别要爬网的正确URL，这不是问题。验证行为是pageFunction为要提取数据的每个页面触发。我通过测试以仅提取每个页面的H1标签为目标。

问题在于，对于每个爬虫（即执行pageFunction时），爬网程序都没有返回H1标签，而是返回...

Error invoking user-provided 'pageFunction': Error: TypeError: JSON.stringify cannot serialize cyclic structures.

我已经读过有关JSON.stringify的信息，但我对此并不完全理解。

调用用户提供的'pageFunction'时出错：错误：TypeError：JSON.stringify无法序列化循环结构

0 个答案: