使用Python搜寻具有特定格式的网站

时间:2019-10-22 10:55:03

标签: python web-scraping

我正在尝试使用Python刮刮大学的《美国新闻》排名,但我一直在努力。我通常使用Python的“请求”和“ BeautifulSoup”。

数据在这里:

https://www.usnews.com/education/best-global-universities/rankings

使用右键单击并检查会显示很多链接,我什至不知道该选择哪个链接。我从网上找到了一个示例,但它只给了我空数据:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import math
from lxml.html import parse
from io import StringIO


url = 'https://www.usnews.com/education/best-global-universities/rankings'
urltmplt = 'https://www.usnews.com/education/best-global-universities/rankings?page=2'

css = '#resultsMain :nth-child(1)'
npage = 20

urlst = [url] + [urltmplt + str(r) for r in range(2,npage+1)]

def scrapevec(url, css):
    doc = parse(StringIO(url)).getroot()
    return([link.text_content() for link in doc.cssselect(css)])

usng = []
for u in urlst:
    print(u)
    ts = [re.sub("\n *"," ", t) for t in scrapevec(u,css) if t != ""]

这不起作用,因为t是一个空数组。

非常感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

您发布的MWE根本不起作用:urlst从未定义,因此无法调用。我强烈建议您寻找基本的抓取教程(使用python,java等):有很多,一般来说是一个好的开始。

下面您将找到一段代码,上面打印着第1页上列出的大学名称-您将能够通过for循环将代码扩展到全部150页。

import requests
from bs4 import BeautifulSoup

newheaders = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}

baseurl = 'https://www.usnews.com/education/best-global-universities/rankings'

page1 = requests.get(baseurl, headers = newheaders) # change headers or get blocked 
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table

for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
    if a < 10: # there are 10 listed universities per page
        print(univ.text)

编辑:现在该示例起作用了,但是正如您在问题中所说,它仅返回空列表。在经过修改的代码版本下方,该代码返回所有大学的列表(第1-150页)

import requests 
from bs4 import BeautifulSoup

def parse_univ(url):
    newheaders = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
    }
    page1 = requests.get(url, headers = newheaders) # change headers or get blocked 
    soup = BeautifulSoup(page1.text, 'lxml')
    res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
    res = []
    for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
        if a < 10: # there are 10 listed universities per page
            res.append(univ.text)
    return res

baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='

ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists

univs = [item for sublist in ll for item in sublist] # unfold the list of lists

按照QHarr的建议重新编辑(谢谢!)-输出相同,更简短,更“ pythonic”的解决方案

import requests 
from bs4 import BeautifulSoup

def parse_univ(url):
    newheaders = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
    }
    page1 = requests.get(url, headers = newheaders) # change headers or get blocked 
    soup = BeautifulSoup(page1.text, 'lxml')
    res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
    return [univ.text for univ in res_tab.select('[href]', limit=10)]

baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='

ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists

univs = [item for sublist in ll for item in sublist]