使用Scrapy抓取数据以加载更多?

时间:2017-03-29 08:36:26

标签: python post scrapy

我需要从此网站抓取一些新闻:https://www.huxiu.com/channel/103.html。这里103是新闻类别id。

但是如果没有触发ajax加载更多,我只能获得第一页: enter image description here
非常奇怪,请求网址对于不同的新闻类别是相同的。

enter image description here

页面信息由引用者通过标题传递。页面由表单数据发送。

以下是我的代码片段:

    self.page += 1
    url = "https://www.huxiu.com/channel/ajaxGetMore"
    method = "POST"

    headers = {
        "Host": "www.huxiu.com",
        "Origin": "https://www.huxiu.com",
        "Referer": "https://www.huxiu.com/channel/106.html",
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/"
            "537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Sa"
            "fari/537.36"
        ),
        "X-Requested-With": "XMLHttpRequest",
    }

    formdata = {
        "huxiu_hash_code": "9aee58d3507ecafed74df13e156ab01b",
        "page": str(self.page),
        "catId": "106"
    }

    yield FormRequest(
        url=url,
        method=method,
        headers=headers,
        formdata=formdata,
        callback=self.parse
    )

无法加载更多新闻Feed。如何发送帖子请求以抓取更多新闻?

1 个答案:

答案 0 :(得分:1)

在这种情况下,GET和POST请求似乎可以互换。 这是一种非常常见的AJAX分页技术:

如果您在浏览器中尝试:https://www.huxiu.com/channel/ajaxGetMore?catId=103&page=3,则会看到一些包含所有分页数据的json数据以及protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.event_home); final Loading loading = new Loading(this); loading.show(); new Handler().postDelayed(new Runnable() { @Override public void run() { loading.cancel(); } }, Constants.WAIT_TIME); toolbar = (Toolbar) findViewById(R.id.toolbar); setSupportActionBar(toolbar); getSupportActionBar().setDisplayHomeAsUpEnabled(true); viewPager = (ViewPager) findViewById(R.id.viewpager); viewPager.setOffscreenPageLimit(7); Interpolator sInterpolator = new AccelerateInterpolator(); try { Field mScroller; mScroller = ViewPager.class.getDeclaredField("mScroller"); mScroller.setAccessible(true); DecelerateInterpolator decelerateInterpolator = new DecelerateInterpolator(); FixedSpeedScroller scroller = new FixedSpeedScroller(viewPager.getContext(), decelerateInterpolator); scroller.SetDuration(500); mScroller.set(viewPager, scroller); } catch (NoSuchFieldException e) { } catch (IllegalArgumentException e) { } catch (IllegalAccessException e) { } setupViewPager(viewPager); tabLayout = (TabLayout) findViewById(R.id.tabs); tabLayout.setupWithViewPager(viewPager); setupTabIcons(); private void setupTabIcons() { tabLayout.addTab(tabLayout.newTab()); tabLayout.getTabAt(0).setCustomView(R.layout.event_tablayout); tabLayout.getTabAt(1).setCustomView(R.layout.event_tablayout); tabLayout.getTabAt(2).setCustomView(R.layout.event_tablayout); tabLayout.getTabAt(3).setCustomView(R.layout.event_tablayout); tabLayout.getTabAt(4).setCustomView(R.layout.event_tablayout); tabLayout.getTabAt(5).setCustomView(R.layout.event_tablayout); tabLayout.getTabAt(6).setCustomView(R.layout.event_tablayout); View tab1_view = tabLayout.getTabAt(0).getCustomView(); View tab2_view = tabLayout.getTabAt(1).getCustomView(); View tab3_view = tabLayout.getTabAt(2).getCustomView(); View tab4_view = tabLayout.getTabAt(3).getCustomView(); View tab5_view = tabLayout.getTabAt(4).getCustomView(); View tab6_view = tabLayout.getTabAt(5).getCustomView(); View tab7_view = tabLayout.getTabAt(6).getCustomView(); TextView tab1_title = (TextView) tab1_view.findViewById(R.id.tabtitle); ImageView img1 = (ImageView) tab1_view.findViewById(R.id.tabicon); TextView tab2_title = (TextView) tab2_view.findViewById(R.id.tabtitle); ImageView img2 = (ImageView) tab2_view.findViewById(R.id.tabicon); TextView tab3_title = (TextView) tab3_view.findViewById(R.id.tabtitle); ImageView img3 = (ImageView) tab3_view.findViewById(R.id.tabicon); TextView tab4_title = (TextView) tab4_view.findViewById(R.id.tabtitle); ImageView img4 = (ImageView) tab4_view.findViewById(R.id.tabicon); TextView tab5_title = (TextView) tab5_view.findViewById(R.id.tabtitle); ImageView img5 = (ImageView) tab5_view.findViewById(R.id.tabicon); TextView tab6_title = (TextView) tab6_view.findViewById(R.id.tabtitle); ImageView img6 = (ImageView) tab6_view.findViewById(R.id.tabicon); TextView tab7_title = (TextView) tab7_view.findViewById(R.id.tabtitle); ImageView img7 = (ImageView) tab7_view.findViewById(R.id.tabicon); tab1_title.setText("DASHBOARD"); img1.setImageResource(R.drawable.ic_home); tab2_title.setText("ABOUT"); img2.setImageResource(R.drawable.ic_information); tab3_title.setText("QR CODE"); img3.setImageResource(R.drawable.ic_qrcode); tab4_title.setText("UPDATES"); img4.setImageResource(R.drawable.ic_announcement); tab5_title.setText("TEAM"); img5.setImageResource(R.drawable.ic_team); tab6_title.setText(" CONTACT US"); img6.setImageResource(R.drawable.ic_mobile_phone); tab7_title.setText("REGISTER"); img7.setImageResource(R.drawable.ic_register); } public void setupViewPager(ViewPager viewPager) { Adapter_pager adapter = new Adapter_pager(getSupportFragmentManager()); adapter.addFrag(new Dashboard(), "Dashboard"); adapter.addFrag(new Details(), "Details"); adapter.addFrag(new Qrcodedisplay(), "Qr codes"); adapter.addFrag(new Update(), "Update"); adapter.addFrag(new Teams(), "Team"); adapter.addFrag(new Contact_Us(), "Contact"); adapter.addFrag(new Event_register(), "Register"); viewPager.setAdapter(adapter); } 等元数据。此信息易于抓取,并允许您同时抓取每个页面,因为您知道第一个请求的页数。

例如,请参阅python3的这个蜘蛛如何处理这种分页:

total_page