我为Yelp创建了一个搜寻器。它工作得很好,但是很慢。首先,我加载结果页面,然后获取所有列表URL,然后转到每个列表以收集公司详细信息。
我是JavaScript和cheerio的新手,所以也许我做错了事。
// Main function for scraping
const getDetails = async (url) => {
try {
// Getting the url
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// Getting the Link of the listing
$('div.businessName__373c0__1fTgn').map((i, element) =>
{
const $element = $(element);
const $l = $element.find('h3').children('a').attr('href');
const $listing = `https://www.yelp.com${$l}`;
console.log(`found ${$listing}`);
listings.push($listing);
});
//Finding the next page link and the cleaning it up
const nextPageLink =$('.pagination-link--
current__373c0__37ym9').parent().parent().parent().next('div').find('a').attr('href');
const page = await 'https://www.yelp.com'+nextPageLink;
console.log(chalk.cyan(` Scraping: ${page}`));
pageCounter++;
// When the pageCounter and pageLimit are equal start scraping for company info
if (pageCounter == pageLimit);
{
await scrapeDetailsPage(listings);
return false;
}
// Passing in the next page link
getDetails(page);
} catch (error) {
scrapeDetailsPage(listings);
console.log(error); } }
// Scrape details page
const scrapeDetailsPage = async (listings) =>
{
// Iterating through each listing link and scrapng the webpage
for(var i = 0; i < listings.length; i++)
{
// Setting the listing array index to detailsLink
const detailsLink = listings[i];
// Scraping for Company details
try
{
const response = await axios.get(detailsLink);
const $ = cheerio.load(response.data);
const $name = $('.biz-page-title').text();
const $category = $('.category-str-list').text().replace('\n','');
const $phone = $('.biz-phone').text().replace('\n','');
const $website = $('.biz-website').find('a').text().replace('\n','');
const $websiteLink = $('.biz- website').find('a').attr('href');
const $address = $('address').text().replace('\n','');
console.log(`Getting ${$name} details`);
// Record the Company Info into a JSON object
data =
{
name: $name,
category: $category,
phone: $phone,
website:$website,
address: $address,
}
details.push(data);
// Export the results from details array to a JSON file
if(listings.length - 1 === i) {
exportResults(details);
console.log(`There is ${details.length} records`);
return false;
}
} catch (error)
{
exportResults(details);
return false;
}
} }
花了5分钟多一点的时间才能获得150个结果。仅加载URL需要花费很长时间。我已经看到其他刮刀起泡很快。