使用Python进行Web抓取时显示所有搜索结果

时间:2014-09-24 13:49:05

标签: python web-scraping html-parsing beautifulsoup

我试图从欧洲议会的立法观察站搜集一份网址列表。我没有输入任何搜索关键字以获取文档的所有链接(目前为13172)。我可以使用下面的代码轻松搜索网站上显示的前10个结果列表。但是,我希望拥有所有链接,以便我不需要以某种方式按下一页按钮。如果您知道实现这一目标的方法,请告诉我。

import requests, bs4, re

# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'

# function gets a list of links to the procedures
def links_to_procedures (url_main):
    # requesting html code from the main search site of the Legislative Observatory
    response = requests.get(url_main)
    soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
    links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
    return links

print(links_to_procedures(url_main))

1 个答案:

答案 0 :(得分:0)

您可以通过指定page GET参数来跟踪分页。

首先,获取结果计数,然后通过除以每页结果计数的计数来计算要处理的页数。然后,逐个遍历页面并收集链接:

import re

from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)

# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)

results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page

links = []
for page in xrange(1, num_results/results_per_page + 1):
    print "Current page: " + str(page)

    url = base_url.format(page=page)
    response = requests.get(url)

    soup = BeautifulSoup(response.content)
    links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]

print links