Android - 使用JSOUP解析JS生成的URL

时间:2016-08-25 08:20:41

标签: javascript java android web-scraping jsoup

我试图解析Bootstrap的Bootpage.js生成的url https://example.com/#page-2 但JSOUP无法解析它并显示主要网址。 如何从Bootpage获取正常链接或如何使JSOUP解析它。

解析代码:

Jsoup.connect("https://example.com/#page-2").followRedirects(true).get();

1 个答案:

答案 0 :(得分:5)

请参阅下面的更新 ,首先/已接受的解决方案未达到Android要求,但仍可供参考。

桌面解决方案

HtmlUnit似乎无法处理此站点(最近经常出现这种情况)。所以我也没有普通的java解决方案,但你可以使用PhantomJSdownload the binary作为你的os,创建一个脚本文件,从你的java代码中启动进程并用一个解析输出dom解析器,如jsoup

脚本文件(此处称为simple.js):

var page = require('webpage').create();
var fs = require('fs');
var system = require('system');

var url = "";
var fileName = "output";
// first parameter: url
// second parameter: filename for output
console.log("args length: " + system.args.length);

if (system.args.length > 1) {
    url=system.args[1];
}
if (system.args.length > 2){
    fileName=system.args[2];
}
if(url===""){
    phantom.exit();
}

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.settings.loadImages = false; 

page.open(url, function(status) {
    console.log("Status: " + status);
    if(status === "success") {
        var path = fileName+'.html';
        fs.write(path, page.content, 'w');
    }
    phantom.exit();
});

Java代码(获取title和cover-url的示例):

try {
    //change path to phantomjs binary and your script file
    String outputFileName = "srulad";
    String phantomJSPath = "phantomjs" + File.separator + "bin" + File.separator + "phantomjs";
    String scriptFile = "simple.js";

    String urlParameter = "http://srulad.com/#page-2";

    new File(outputFileName+".html").delete();

    Process process = Runtime.getRuntime().exec(phantomJSPath + " " + scriptFile + " " + urlParameter + " " + outputFileName);
    process.waitFor();

    Document doc = Jsoup.parse(new File(outputFileName + ".html"),"UTF-8"); // output.html is created by phantom.js, same path as page.js
    Elements elements = doc.select("#list_page-2 > div");

    for (Element element : elements) {
        System.out.println(element.select("div.l-description.float-left > div:nth-child(1) > a").first().attr("title"));
        System.out.println(element.select("div.l-image.float-left > a > img.lazy").first().attr("data-original"));
    }
} catch (IOException | InterruptedException e) {
    e.printStackTrace();
}

<强>输出:

სიყვარული და მოწყალება / Love & Mercy
http://srulad.com/assets/uploads/42410_Love_and_Mercy.jpg
მუზა / The Muse
http://srulad.com/assets/uploads/43164_large_qRzsimNz0eDyFLFJcbVLIxlqii.jpg
...

更新

使用WebView和jsoup可以在Android中使用基于javascript的动态内容解析网站。 以下示例应用程序使用启用了JavaScript的WebView来呈现依赖于Javascript的网站。使用JavascriptInterface返回html源代码,使用jsoup进行解析,并作为概念证明,封面图像的标题和URL用于填充ListView。按钮递减或递增页码,触发ListView的更新。 注意:在Android 5.1.1 / API 22设备上测试。

为您的AndroidManifest.xml添加互联网权限

<uses-permission android:name="android.permission.INTERNET" />

activity_main.xml中

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    android:orientation="vertical"
    android:layout_width="match_parent"
    android:layout_height="match_parent">

    <LinearLayout
        android:orientation="horizontal"
        android:layout_width="match_parent"
        android:layout_height="wrap_content">

        <Button
            android:layout_width="wrap_content"
            android:layout_height="wrap_content"
            android:text="@string/page_down"
            android:id="@+id/buttonDown"
            android:layout_weight="0.5" />

        <Button
            android:layout_width="wrap_content"
            android:layout_height="wrap_content"
            android:text="@string/page_up"
            android:id="@+id/buttonUp"
            android:layout_weight="0.5" />
    </LinearLayout>

    <ListView
        android:layout_width="match_parent"
        android:layout_height="0dp"
        android:id="@+id/listView"
        android:layout_gravity="bottom"
        android:layout_weight="0.5" />
</LinearLayout>

MainActivity.java

public class MainActivity extends AppCompatActivity {

    private final Handler uiHandler = new Handler();
    private ArrayAdapter<String> adapter;
    private ArrayList<String> entries = new ArrayList<>();
    private ProgressDialog progressDialog;

    private class JSHtmlInterface {
        @android.webkit.JavascriptInterface
        public void showHTML(String html) {
            final String htmlContent = html;

            uiHandler.post(
                new Runnable() {
                    @Override
                    public void run() {
                        Document doc = Jsoup.parse(htmlContent);
                        Elements elements = doc.select("#online_movies > div > div");
                        entries.clear();
                        for (Element element : elements) {
                            String title = element.select("div.l-description.float-left > div:nth-child(1) > a").first().attr("title");
                            String imgUrl = element.select("div.l-image.float-left > a > img.lazy").first().attr("data-original");
                            entries.add(title + "\n" + imgUrl);
                        }
                        adapter.notifyDataSetChanged();
                    }
                }
            );
        }
    }


    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        ListView listView = (ListView) findViewById(R.id.listView);
        adapter = new ArrayAdapter<>(this, android.R.layout.simple_list_item_1, android.R.id.text1, entries);
        listView.setAdapter(adapter);

        progressDialog = ProgressDialog.show(this, "Loading","Please wait...", true);
        progressDialog.setCancelable(false);

        try {
            final WebView browser = new WebView(this);
            browser.setVisibility(View.INVISIBLE);
            browser.setLayerType(View.LAYER_TYPE_NONE,null);
            browser.getSettings().setJavaScriptEnabled(true);
            browser.getSettings().setBlockNetworkImage(true);
            browser.getSettings().setDomStorageEnabled(false);
            browser.getSettings().setCacheMode(WebSettings.LOAD_NO_CACHE);
            browser.getSettings().setLoadsImagesAutomatically(false);
            browser.getSettings().setGeolocationEnabled(false);
            browser.getSettings().setSupportZoom(false);

            browser.addJavascriptInterface(new JSHtmlInterface(), "JSBridge");

            browser.setWebViewClient(
                new WebViewClient() {

                    @Override
                    public void onPageStarted(WebView view, String url, Bitmap favicon) {
                        progressDialog.show();
                        super.onPageStarted(view, url, favicon);
                    }

                    @Override
                    public void onPageFinished(WebView view, String url) {
                        browser.loadUrl("javascript:window.JSBridge.showHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
                        progressDialog.dismiss();
                    }
                }
            );

            findViewById(R.id.buttonDown).setOnClickListener(new View.OnClickListener() {
                @Override
                public void onClick(View view) {
                    uiHandler.post(new Runnable() {
                        @Override
                        public void run() {
                            int page = Integer.parseInt(browser.getUrl().split("-")[1]);
                            int newPage = page > 1 ? page-1 : 1;
                            browser.loadUrl("http://srulad.com/#page-" + newPage);
                            browser.loadUrl(browser.getUrl()); // not sure why this is needed, but doesn't update without it on my device
                            if(getSupportActionBar()!=null) getSupportActionBar().setTitle(browser.getUrl());
                        }
                    });
                }
            });

            findViewById(R.id.buttonUp).setOnClickListener(new View.OnClickListener() {
                @Override
                public void onClick(View view) {
                    uiHandler.post(new Runnable() {
                        @Override
                        public void run() {
                            int page = Integer.parseInt(browser.getUrl().split("-")[1]);
                            int newPage = page+1;
                            browser.loadUrl("http://srulad.com/#page-" + newPage);
                            browser.loadUrl(browser.getUrl()); // not sure why this is needed, but doesn't update without it on my device
                            if(getSupportActionBar()!=null) getSupportActionBar().setTitle(browser.getUrl());
                        }
                    });
                }
            });

            browser.loadUrl("http://srulad.com/#page-1");
            if(getSupportActionBar()!=null) getSupportActionBar().setTitle(browser.getUrl());

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}