Scrapy无法访问子div类

时间:2019-12-09 02:59:30

标签: python html xpath web-scraping scrapy

我正在使用Scrapy刮除此网页https://researchgrant.gov.sg/eservices/advanced-search/?keyword=&source=sharepoint&type=project&status=open&page=2&_pp_projectstatus=&_pp_hiname=ab&_pp_piname=pua&_pp_source=sharepoint&_pp_details=#project中表格中的href链接。我可以访问div MVCGridTableHolder_advancesearchawardedprojectsp_,但无法访问其子级,它们是div类的行和div样式,我的尝试如下所示。是因为局部视图吗?

html代码:

<div id="MVCGridContainer_advancesearchawardedprojectsp_" data-key="" class="MVCGridContainer">
<!--Partial View!-->
<div class="row"></div>
<div style="overflow-x:auto;">
<table name="MVCGridTable_advancesearchawardedprojectsp" class="table table-striped table-bordered iris-grid">
<thead></thead>
<tbody>
      <tr>
         <td>
         <a class="grid-link" target="_top" href="https://researchgrant.gov.sg/pages/Awarded-Project-Detail.aspx?AXID=MOH-000080&amp;CompanyCode=moh">INVESTIGATING DIVERSIFIED BIFUNCTIONAL MACROCYCLES BY PHAGE DISPLAY AS A NOVEL TECHNOLOGY PLATFORM</a>
         </td>
</div></div>

Scrapy shell尝试:

In [12]: quote = response.xpath('//div[@id="MVCGridTableHolder_advancesearchawardedprojectsp_"]')

In [13]: quote
Out[13]: [<Selector 
xpath='//div[@id="MVCGridTableHolder_advancesearchawardedprojectsp_"]' data='<div id="MVCGridTableHolder_advancese...'>]

In [14]: quote = response.xpath('//div[@id="MVCGridTableHolder_advancesearchawardedprojectsp_"]/div[@class="row"]')

In [15]: quote
Out[15]: []

1 个答案:

答案 0 :(得分:0)

如果在加载此页面时在浏览器中打开浏览器开发人员工具,则会看到有单独的XHR请求发送来加载该部分视图内容。您可以在代码中模拟该请求。

使用requests的示例:

import requests


with requests.Session() as session:
    session.verify = False

    session.headers = {
        'X-Requested-With': 'XMLHttpRequest'
    }
    response = session.post("https://researchgrant.gov.sg/eservices/mvcgrid", params={
        'keyword': '',
        'source': 'sharepoint',
        'type': 'project',
        'status': 'open',
        'page': '2',
        '_pp_projectstatus': '',
        '_pp_hiname': 'ab',
        '_pp_piname': 'pua',
        '_pp_source': 'sharepoint',
        '_pp_details': ''},
        data={
            'name': 'advancesearchawardedprojectsp'
        })

    print(response.text)

在Scrapy中,您可以使用FormRequest