您好,我是使用Puppeter进行网页抓取的新手,目前我正面临下一个问题:
在我试图提取信息的站点中,我有一个带有典型js分页的引导表,例如以下示例: https://getbootstrap.com/docs/4.1/components/pagination/
当我使用Chrome Inspector检查页面html时,我只能看到 2 ,当我检查链接位置时,我可以看到
我怎么知道总共有多少页?以及我如何单击它们?我不明白如何访问这种分页的每一页。
谢谢!
答案 0 :(得分:0)
没有万无一失的方法,但是我按此顺序处理分页,
目标HTML代码:
<!-- Copied from: https://jsfiddle.net/solodev/yw7y4wez -->
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Pagination Example</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="robots" content="noindex, nofollow">
<meta name="googlebot" content="noindex, nofollow">
<meta name="viewport" content="width=device-width, initial-scale=1">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script type="text/javascript" src="https://www.solodev.com/assets/pagination/jquery.twbsPagination.js"></script>
<style type="text/css">
.container {
margin-top: 20px;
}
.page {
display: none;
}
.page-active {
display: block;
}
</style>
<script type="text/javascript">
window.onload = function() {
$('#pagination-demo').twbsPagination({
totalPages: 5,
// the current page that show on start
startPage: 1,
// maximum visible pages
visiblePages: 5,
initiateStartPageClick: true,
// template for pagination links
href: false,
// variable name in href template for page number
hrefVariable: '{{number}}',
// Text labels
first: 'First',
prev: 'Previous',
next: 'Next',
last: 'Last',
// carousel-style pagination
loop: false,
// callback function
onPageClick: function(event, page) {
$('.page-active').removeClass('page-active');
$('#page' + page).addClass('page-active');
},
// pagination Classes
paginationClass: 'pagination',
nextClass: 'next',
prevClass: 'prev',
lastClass: 'last',
firstClass: 'first',
pageClass: 'page',
activeClass: 'active',
disabledClass: 'disabled'
});
}
</script>
</head>
<body>
<div class="container">
<div class="jumbotron page" id="page1">
<div class="container">
<h1 class="display-3">Adding Pagination to your Website</h1>
<p class="lead">In this article we teach you how to add pagination, an excellent way to navigate large amounts of content, to your website using a jQuery Bootstrap Plugin.</p>
<p><a class="btn btn-lg btn-success" href="https://www.solodev.com/blog/web-design/adding-pagination-to-your-website.stml" role="button">Learn More</a></p>
</div>
</div>
<div class="jumbotron page" id="page2">
<h1 class="display-3">Not Another Jumbotron</h1>
<p class="lead">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p>
<p><a class="btn btn-lg btn-success" href="#" role="button">Sign up today</a></p>
</div>
<div class="jumbotron page" id="page3">
<h1 class="display-3">Data. Data. Data.</h1>
<p>This example is a quick exercise to illustrate how the default responsive navbar works. It's placed within a <code>.container</code> to limit its width and will scroll with the rest of the page's content.
</p>
<p>
<a class="btn btn-lg btn-primary" href="../../components/navbar" role="button">View navbar docs »</a>
</p>
</div>
<div class="jumbotron page" id="page4">
<h1 style="-webkit-user-select: auto;">Buy Now!</h1>
<p class="lead" style="-webkit-user-select: auto;">Cras justo odio, dapibus ac facilisis in, egestas eget quam. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet.</p>
<p style="-webkit-user-select: auto;"><a class="btn btn-lg btn-success" href="#" role="button" style="-webkit-user-select: auto;">Get
started today</a></p>
</div>
<div class="jumbotron page" id="page5">
<h1 class="cover-heading">Cover your page.</h1>
<p class="lead">Cover is a one-page template for building simple and beautiful home pages. Download, edit the text, and add your own fullscreen background photo to make it your own.</p>
<p class="lead">
<a href="#" class="btn btn-lg btn-primary">Learn more</a>
</p>
</div>
<ul id="pagination-demo" class="pagination-lg pull-right"></ul>
</div>
<script>
// tell the embed parent frame the height of the content
if (window.parent && window.parent.parent) {
window.parent.parent.postMessage(["resultsFrame", {
height: document.body.getBoundingClientRect().height,
slug: "yw7y4wez"
}], "*")
}
</script>
</body>
</html>
这是示例代码的工作版本,
const puppeteer = require('puppeteer');
async function runScraper() {
let browser = {};
let page = {};
const url = 'http://localhost:8080';
// open the page and wait
async function navigate() {
browser = await puppeteer.launch({ headless: false });
page = await browser.newPage();
await page.goto(url);
}
async function scrapeData() {
const headerSel = 'h1';
// wait for element
await page.waitFor(headerSel);
return page.evaluate((selector) => {
const target = document.querySelector(selector);
// get the data
const text = target.innerText;
// remove element so the waiting function works
target.remove();
return text;
}, headerSel);
}
// this is a sample concept of pagination
// it will vary from page to page because not all site have same type of pagination
async function paginate() {
// manually check if the next button is available or not
const nextBtnDisabled = !!(await page.$('.next.disabled'));
if (!nextBtnDisabled) {
// since it's not disable, click it
await page.evaluate(() => document.querySelector('.next').click());
// just some random waiting function
await page.waitFor(100);
return true;
}
console.log({ nextBtnDisabled });
}
/**
* Scraping Logic
*/
await navigate();
// Scrape 5 pages
for (const pageNum of [...Array(5).keys()]) {
const title = await scrapeData();
console.log(pageNum + 1, title);
await paginate();
}
}
runScraper();
结果:
Server running at 8080
1 'Adding Pagination to your Website'
2 'Not Another Jumbotron'
3 'Data. Data. Data.'
4 'Buy Now!'
5 'Cover your page.'
{ nextBtnDisabled: true }
我没有共享服务器代码,基本上是上面的html代码段。
答案 1 :(得分:0)
使用属性 footerTemplate 和 displayHeaderFooter 来显示最初使用操纵符API的显示页面
await page.pdf({
path: 'hacks.pdf',
format: 'A4',
displayHeaderFooter: true,
footerTemplate: '<div><div class='pageNumber'></div> <div>/</div><div class='totalPages'></div></div>'
});
https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagepdfoptions
footerTemplate 用于打印页脚的HTML模板。
//应该是有效的HTML标记,其中包含以下用于插入打印值的 CSS类:
//-日期格式化的打印日期
//-标题文档标题
//- url 文档位置
//- pageNumber 当前页号
//-文档中的 totalPages 页总数