Question

在使用Cheerio的js中，对于此块，“*”是动态文本（例如ID＃）：

<a class="article-link*" href="https://www.somedomain.com">

如何提取网址？

是否可以使用通配符获取元素名称的一部分后的信息？我试过了：

$("[class = 'article-link']*") 
//fails probably because the string is terminated prematurely

$("*[class = 'article-link]*")
//malformed attribute (obviously, but thought I'd give it a whack)

$("*[class = 'article-link*']")
//fails (again, obviously)

$("*[class = 'article-link\*']")
//I was trying to escape the string, but I believe cheerio encapsulates the break character as part of the string because it's inside of [] - and idk if the wildcard can even be used this way

仅供参考 - 我可以使用这样的通配符来获取另一个元素，其中标记之前的信息不相同（本例中为itemprop），例如前面有不同的标题标记：

var titleElem = $("*[itemprop = 'title']").get()
//gets [itemprop = 'title'] regardless of previous tag(s)

Answer 1

如果动态文本是由Javascript生成的，则无法通过cheerio访问它，因为cheerio只是一个DOM解析器。

如果是这种情况，您需要模拟浏览器操作，则可以查看this information或PhantomJS。

Answer 2

request的问题在于它无法执行javascript渲染数据。尝试使用无头浏览器。 Nightmare是一个很棒的人。

npm install nightmare --save

您使用梦魇实例拨打电话，然后将html代码传递给您的cheerio。以下是样本：

const Nightmare = require('nightmare')
const nightmare = Nightmare({ show: true })
const cheerio = require('cheerio');

nightmare
    .goto(url)

    //do something in the chain to go to your desired page.

   .evaluate(() => document.querySelector('body').outerHTML)

   .then(function (html) {
      cheerio.load(html);
      // do something in cheerio perhaps something like:

    let links = $("a[class^='article-link]").map(function(i, element) {

        return $(this).attr('href');
      }).toArray();

    console.log(links) // => [link1, link2, ...]

})
.catch(function (error) {
console.error('Error:', error);
});

Answer 3

我访问它的方式是：

const cheerio = require('cheerio');
const $ = cheerio.load(html);

//article is the div directly above this link, list-wrapper the div before that, a is this element
const rows = $('.list-wrapper article a');

//.attr selects an elements attributes
url = $(rows).attr('href').trim();

我还有其他元素可以从这个课程中获取，或者我会在一行中完成这个：

url = $('.list-wrapper article a').attr('href').trim();

js / cheerio在类中使用动态名称获取URL

3 个答案: