无法抓取HTML网站?

时间:2012-10-09 02:42:15

标签: java android eclipse parsing jsoup

所以我试图让我的应用程序进入网站,从该网站获取HTML,从HTML中删除不必要的元素,然后在我的make shift应用程序中加载'内容',因为我没有拥有API或Feed。我正在使用Jsoup,如果我不在android中进行网络抓取,它能够正常工作,但是android并不喜欢它。

public class SimpleDiggActivity extends Activity {

private WebView browser;
final Activity activity = this;

@Override
public void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    this.getWindow().requestFeature(Window.FEATURE_PROGRESS);

    setContentView(R.layout.main);

    getWindow().setFeatureInt(Window.FEATURE_PROGRESS, Window.PROGRESS_VISIBILITY_ON);

    String url = "http://www.digg.com";
    Document digg;
    browser = (WebView) findViewById(R.id.mybrowser);
    final Button homeDigg = (Button) findViewById(R.id.button1);

    browser.setWebViewClient(new SimpleWebViewClient());

    browser.getSettings().setJavaScriptEnabled(true);
    browser.getSettings().setUseWideViewPort(true);
    browser.getSettings().setLoadWithOverviewMode(true);
    browser.getSettings().setDisplayZoomControls(false);
    browser.getSettings().setEnableSmoothTransition(true);
    browser.getSettings().setBuiltInZoomControls(true);
    browser.getSettings().setUserAgentString("Android");

    // progressCircle = ProgressDialog.show(SimpleDiggActivity.this, "", "Loading...");
    final ProgressDialog progressCircle = new ProgressDialog(activity);
    progressCircle.setProgressStyle(ProgressDialog.STYLE_SPINNER);
    progressCircle.setMessage("Loading...");
    progressCircle.setCancelable(false);

    try{
        Toast.makeText(getApplicationContext(), "No Steps down", Toast.LENGTH_SHORT).show();
        Document diggTest = Jsoup.connect("http://digg.com/enable/mobile").get();
        Toast.makeText(getApplicationContext(), "1 Steps down", Toast.LENGTH_SHORT).show();
        String diggTitle = diggTest.title();
        Toast.makeText(getApplicationContext(), "2 Steps down"    , Toast.LENGTH_SHORT).show();
        Document compressed = Jsoup.parseBodyFragment(diggTitle);
        Toast.makeText(getApplicationContext(), "3 Steps down", Toast.LENGTH_SHORT).show();
        org.jsoup.select.Elements div = diggTest.select("div");
        Toast.makeText(getApplicationContext(), "4 Steps down", Toast.LENGTH_SHORT).show();
        String divBrow = div.toString();
        Toast.makeText(getApplicationContext(), "5 Steps down", Toast.LENGTH_SHORT).show();
        browser.loadUrl(divBrow);
    }catch (Exception e){
        e.printStackTrace();

        Toast.makeText(getApplicationContext(), "Gave up", Toast.LENGTH_SHORT).show();
        String diggBrow = url;
        browser.loadUrl("http://www.google.com");
    }

对不起,如果它很乱,我只是搞乱了,这是我的第一次。 Toasts是让我告诉代码何时失败尝试并使用catch。当我运行它时,它不会过去

 Document diggTest = Jsoup.connect("http://digg.com/enable/mobile").get();

1 个答案:

答案 0 :(得分:0)

我尝试使用JSOUP 1.7.1版本的代码,它在我的最终工作正常。以下是工作代码:

public class SimpleDiggActivity extends Activity {

    final Activity activity = this;

    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        this.getWindow().requestFeature(Window.FEATURE_PROGRESS);

        setContentView(R.layout.activity_simple_digg);

        getWindow().setFeatureInt(Window.FEATURE_PROGRESS,
                Window.PROGRESS_VISIBILITY_ON);

        String url = "http://www.digg.com";
        Document digg;

        // progressCircle = ProgressDialog.show(SimpleDiggActivity.this, "",
        // "Loading...");
        final ProgressDialog progressCircle = new ProgressDialog(activity);
        progressCircle.setProgressStyle(ProgressDialog.STYLE_SPINNER);
        progressCircle.setMessage("Loading...");
        progressCircle.setCancelable(false);

        try {
            Toast.makeText(getApplicationContext(), "No Steps down",
                    Toast.LENGTH_SHORT).show();
            Document diggTest = Jsoup.connect("http://digg.com/enable/mobile")
                    .get();
            Toast.makeText(getApplicationContext(), "1 Steps down",
                    Toast.LENGTH_SHORT).show();
            String diggTitle = diggTest.title();
            Toast.makeText(getApplicationContext(), "2 Steps down",
                    Toast.LENGTH_SHORT).show();
            Document compressed = Jsoup.parseBodyFragment(diggTitle);
            Toast.makeText(getApplicationContext(), "3 Steps down",
                    Toast.LENGTH_SHORT).show();
            org.jsoup.select.Elements div = diggTest.select("div");
            Toast.makeText(getApplicationContext(), "4 Steps down",
                    Toast.LENGTH_SHORT).show();
            String divBrow = div.toString();
            Toast.makeText(getApplicationContext(), "5 Steps down",
                    Toast.LENGTH_SHORT).show();
            Log.d(this.getClass().getSimpleName(), "data is " + divBrow);
        } catch (Exception e) {
            e.printStackTrace();

            Toast.makeText(getApplicationContext(), "Gave up",
                    Toast.LENGTH_SHORT).show();
            String diggBrow = url;
        }
    }
}

以下是divBrow的值:

10-10 11:58:45.631: D/SimpleDiggActivity(350): data is <div class="site-header-container page-container"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):  <header class="site-header"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):   <h1 class="site-header-logo-container"><a href="/" id="site-header-logo" class="image-replace">Digg</a></h1> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):  </header> 
10-10 11:58:45.631: D/SimpleDiggActivity(350): </div>
10-10 11:58:45.631: D/SimpleDiggActivity(350): <div id="container" class="page-container"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):  <ul id="top-stories"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):   <li class="story-container story-1" data-content-id="Racz8K" id="story-Racz8K"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):    <div class="story-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-kicker">
10-10 11:58:45.631: D/SimpleDiggActivity(350):       NO FILTER 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-headline"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="0" class="story-link" href="http://www.fastcompany.com/3001994/no-filter-inside-hipstamatics-lost-year-searching-next-killer-social-app" data-content-id="Racz8K"> Inside Hipstamatic’s Lost Year Searching For The Next Killer Social&nbsp;App </a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-domain"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-link-wrapper"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <a data-position="0" class="story-link" href="http://www.fastcompany.com/3001994/no-filter-inside-hipstamatics-lost-year-searching-next-killer-social-app" data-content-id="Racz8K">fastcompany.com</a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-actions"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <span class="story-action-item story-score"> <span class="story-score-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         <ul class="story-score-details-list"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-thumb-Racz8K story-score-thumb">20</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-tweets-Racz8K story-score-twitter">402</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-fb_shares-Racz8K story-score-facebook">72</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         </ul> </span> <span class="story-score-Racz8K">494</span> </span> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-image"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="0" class="story-link" href="http://www.fastcompany.com/3001994/no-filter-inside-hipstamatics-lost-year-searching-next-killer-social-app" data-content-id="Racz8K"><img src="http://static.digg.com/images/Racz8K_1_www_large_thumb.jpeg" alt="" width="312" height="170" /></a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-preview">
10-10 11:58:45.631: D/SimpleDiggActivity(350):      From rooftop bashes and acquisition talks to staff clashes and layoffs, Hipstamatic’s founders and ex-employees describe the startup’s losing struggle to keep pace with Instagram, Facebook, and others in the white-hot photo-sharing space.
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):    </div> </li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):   <li class="story-container story-1" data-content-id="Qa2sP3" id="story-Qa2sP3"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):    <div class="story-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-kicker">
10-10 11:58:45.631: D/SimpleDiggActivity(350):       PHOTOGRAPHY 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-headline"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="1" class="story-link" href="http://lens.blogs.nytimes.com/2012/10/09/looking-into-the-eyes-of-made-in-china/" data-content-id="Qa2sP3"> Looking Into The Eyes Of 'Made In&nbsp;China' </a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-domain"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-link-wrapper"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <a data-position="1" class="story-link" href="http://lens.blogs.nytimes.com/2012/10/09/looking-into-the-eyes-of-made-in-china/" data-content-id="Qa2sP3">lens.blogs.nytimes.com</a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-actions"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <span class="story-action-item story-score"> <span class="story-score-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         <ul class="story-score-details-list"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-thumb-Qa2sP3 story-score-thumb">0</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-tweets-Qa2sP3 story-score-twitter">252</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-fb_shares-Qa2sP3 story-score-facebook">411</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         </ul> </span> <span class="story-score-Qa2sP3">663</span> </span> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-image"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="1" class="story-link" href="http://lens.blogs.nytimes.com/2012/10/09/looking-into-the-eyes-of-made-in-china/" data-content-id="Qa2sP3"><img src="http://static.digg.com/images/Qa2sP3_1_www_large_thumb.jpeg" alt="" width="312" height="170" /></a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-preview">
10-10 11:58:45.631: D/SimpleDiggActivity(350):      In “Faces of Made in China,” a series of typological portraits looking at workers inside six Chinese factories, the photographer Lucas Schifres seeks to consider the otherwise anonymous people who produce our essential possessions by looking directly into their eyes.
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 

请在最后试一试,让我知道它是怎么回事。