我是pupeteer的新手,不知道它的全部潜力。我有以下代码,从scrape返回结果。但是格式是一个长制表符分隔的字符串。我正在尝试获取正确的json。
(async () => {
const browser = await puppeteer.launch( {headless: true} );
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle0'});
let data = await page.evaluate(() => {
const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr '));
return table.map(td => td.innerText);
})
console.log(data);
})();
这是html表:
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
<tr >
<th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col"> </th>
</tr>
<tr >
<td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td> </td><td> </td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
</tr>
<tr style="background-color:LightGreen;">
<td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;"> </td><td align="center"></td>
</tr>
</table>
这就是我得到的:
[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ',
'31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t',
'1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']
这就是我想要得到的:
[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'},
{'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'},
{'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]
我在下面应用了答案,但是无法重现结果。也许我做错了。您能解释一下我怎么搞的吗?
const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');
const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log(
query('tr', context).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
);
此错误是什么意思?
(node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1)
(node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
答案 0 :(得分:0)
我认为这与Puppeteer无关,但与您在<table>
上“迭代”的方式有关:
在尝试时,您只是在转储整行的文本内容,这会产生您要观察的结果。实际上,每个<tr>
都需要获取其所有<td>
(或<th>
)元素:
const query = (selector, context) =>
Array.from(context.querySelectorAll(selector));
console.log(
query('tr', document).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
)
<table>
<tr>
<th>col 1</th>
<th>col 2</th>
<th>col 3</th>
</tr>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
</tr>
<tr>
<td>x</td>
<td>y</td>
<td>z</td>
</tr>
</table>
答案 1 :(得分:0)
如果您需要表中的数组数组,则可以尝试这种方法,将所有行映射到行数组,并将所有单元格映射到行元素内的单元格数组(此变体使用Array.from()
并将映射函数作为第二个参数):
const data = await page.evaluate(
() => Array.from(
document.querySelectorAll('table[id="gvM"] > tbody > tr'),
row => Array.from(row.querySelectorAll('th, td'), cell => cell.innerText)
)
);