我正在使用Scrapy刮除此网页https://researchgrant.gov.sg/eservices/advanced-search/?keyword=&source=sharepoint&type=project&status=open&page=2&_pp_projectstatus=&_pp_hiname=ab&_pp_piname=pua&_pp_source=sharepoint&_pp_details=#project中表格中的href链接。我可以访问div MVCGridTableHolder_advancesearchawardedprojectsp_
,但无法访问其子级,它们是div类的行和div样式,我的尝试如下所示。是因为局部视图吗?
html代码:
<div id="MVCGridContainer_advancesearchawardedprojectsp_" data-key="" class="MVCGridContainer">
<!--Partial View!-->
<div class="row"></div>
<div style="overflow-x:auto;">
<table name="MVCGridTable_advancesearchawardedprojectsp" class="table table-striped table-bordered iris-grid">
<thead></thead>
<tbody>
<tr>
<td>
<a class="grid-link" target="_top" href="https://researchgrant.gov.sg/pages/Awarded-Project-Detail.aspx?AXID=MOH-000080&CompanyCode=moh">INVESTIGATING DIVERSIFIED BIFUNCTIONAL MACROCYCLES BY PHAGE DISPLAY AS A NOVEL TECHNOLOGY PLATFORM</a>
</td>
</div></div>
Scrapy shell尝试:
In [12]: quote = response.xpath('//div[@id="MVCGridTableHolder_advancesearchawardedprojectsp_"]')
In [13]: quote
Out[13]: [<Selector
xpath='//div[@id="MVCGridTableHolder_advancesearchawardedprojectsp_"]' data='<div id="MVCGridTableHolder_advancese...'>]
In [14]: quote = response.xpath('//div[@id="MVCGridTableHolder_advancesearchawardedprojectsp_"]/div[@class="row"]')
In [15]: quote
Out[15]: []
答案 0 :(得分:0)
如果在加载此页面时在浏览器中打开浏览器开发人员工具,则会看到有单独的XHR请求发送来加载该部分视图内容。您可以在代码中模拟该请求。
使用requests
的示例:
import requests
with requests.Session() as session:
session.verify = False
session.headers = {
'X-Requested-With': 'XMLHttpRequest'
}
response = session.post("https://researchgrant.gov.sg/eservices/mvcgrid", params={
'keyword': '',
'source': 'sharepoint',
'type': 'project',
'status': 'open',
'page': '2',
'_pp_projectstatus': '',
'_pp_hiname': 'ab',
'_pp_piname': 'pua',
'_pp_source': 'sharepoint',
'_pp_details': ''},
data={
'name': 'advancesearchawardedprojectsp'
})
print(response.text)
在Scrapy中,您可以使用FormRequest
: