刮擦输出为空

时间:2019-08-23 02:43:44

标签: web-scraping scrapy

我正在尝试使用Scrapy通过

从IEEE Xplore提取论文标题。

scrapy shell 'https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5962385'

对于第一个论文标题,我使用复制Xpath来获取Xpath。然后,我尝试了

response.xpath('//*[@id="publicationIssueMainContent"]/div[2]/div/div[2]/div/xpl-issue-results-list/div[2]/div[4]/div/xpl-issue-results-items/div[2]/div[2]/h2/a').getall()

我也尝试过response.css(div.List-results-items)

但是,这两种方法都没有输出。

1 个答案:

答案 0 :(得分:0)

数据通过xhr POST请求动态加载。您可以简单地使用请求发出xhr来获取所有页面结果为json

import requests

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Referer': 'https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5962385',
    'Accept': 'application/json, text/plain, */*',
    'cache-http-response': 'false',
    'Content-Type': 'application/json',
}

params = (
    ('punumber', '5962385'),
)

data = {"punumber":"5962385","sortType":"vol-only-seq","isnumber":8809853}
results = {}

with requests.Session() as s:
    r = s.post('https://ieeexplore.ieee.org/rest/search/pub/5962385/issue/8809853/toc', headers=headers, params=params, json=data).json()
    results[1] = r
    num_pages = r['totalPages']

    for page in range(2, num_pages + 1):
        data['pageNumber'] = page
        r = s.post('https://ieeexplore.ieee.org/rest/search/pub/5962385/issue/8809853/toc', headers=headers, params=params, json=data).json()
        results[page] = r