Question

我猜它在这里显示了我的怪癖，但我如何才能获得网站的HTML演示文稿？例如，我试图从Wix站点检索HTML结构（用户在屏幕上实际查看的内容），但我得到了很多网站上存在的脚本。我正在做一个小的代码测试来进行抓取。非常感谢。

Answer 1

好的，我们走了。抱歉耽搁了。

我使用selenium来加载页面，这样我就可以确保捕获所有标记，即使它是由ajax加载的。确保抓住独立的库，这让我陷入了困境。

检索到html后，我将其传递给jsoup，我用它来遍历文档并删除所有文本。

以下是示例代码：

// selenium to grab the html
// i chose to use this to get anything that may be loaded by ajax
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;

// jsoup for parsing the html
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

import java.io.IOException;


public class Example  {
    public static void main(String[] args) {
        // Create a new instance of the html unit driver
        // Notice that the remainder of the code relies on the interface, 
        // not the implementation.
        WebDriver driver = new FirefoxDriver();

        // And now use this to visit stackoverflow
        driver.get("http://stackoverflow.com/");

        // Get the page source
        String html =  driver.getPageSource();


        Document doc = Jsoup.parse(html, "", Parser.xmlParser());

        for (Element el : doc.select("*")){
            if (!el.ownText().isEmpty()){
                for (TextNode node : el.textNodes())
                    node.remove();
            }
        }

        System.out.println(doc);

        driver.quit();
    }
}

不确定您是否也要删除属性标记，目前它们已被删除。但是，修改代码以便删除部分或全部属性标记也很容易。

Answer 2

如果您只需要页面中的内容，则可以在每个网址上使用?_escaped_fragment_来获取静态内容。

_escaped_fragment_是一种标准方法，用于Ajax抓取以抓取本质上是动态的页面或在客户端生成/呈现的页面。

基于Wix的网站支持_escaped_fragment。

如何获取网站的HTML输出结构

2 个答案: