Question

正如问题所述，我使用 CURl 进行网页抓取，我得到了一个包含所有 html 元素但没有正确缩进的响应。

curl somewebsite.com/somepage > scrape.html/scrape.txt

执行此命令后，数据将保存在 scrape.txt 或 scrape.html 文件中，内容看起来非常混乱，而且大多仅在 1 行中。

文件内容看起来像这样

<!DOCTYPE html><html lang="en"><head><script src="/cdn-cgi/apps/head/a2ff1ftsK3yTu21p1BeEN2BZsnA.js"></script><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&amp;family=DM+Sans:wght@400&amp;display=swap" rel="stylesheet" media="print" onload="if(!window._isAppPrerendering)this.removeAttribute(&quot;media&quot;);"><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&amp;family=DM+Sans:wght@400&amp;display=swap" rel="preload" as="style"><link href="https://fonts.gstatic.com" rel="preconnect" crossorigin="true"><meta charset="utf-8">

正如你在上面看到的，它全部在 1 行中，直到

curl 中是否有任何技术或任何其他简单的方法来获取带有缩进的抓取网页的输出？

我对 PHP、javascript 或 NodeJS 的解决方案没意见

先谢谢你......

Answer 1

无法为没有人回答的问题找到解决方案。

我的解决办法是使用一些美化工具，比如

https://beautifytools.com/html-beautifier.php#

此工具实际上适用于具有大量脚本和样式的大型网站。

Answer 2

<块引用>

curl somewebsite.com/somepage | php -r '$d=new DOMDocument();$d->preserveWhiteSpace=false;$d->formatOutput=true;@$d->loadHTML(stream_get_contents(STDIN), LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS);echo $d-> saveXML();' > scrape.html/scrape.txt

如何使用 curl 并正确缩进 html 元素来抓取网站

2 个答案: