请求代码来抓取分页网站

时间:2015-09-19 07:10:37

标签: python web-scraping python-requests

我正试图从维基百科中删除几个编号页面(以年为单位):

for year in range(1991, 2000, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url)

x = BeautifulSoup(source.text, "html.parser")

x

然而,当检查'x'时,我看到我只下载了1999页面。我如何刮掉1991年至2000年所需的所有页面?

并将它们放入每年(关键)文本(值)的字典中?

2 个答案:

答案 0 :(得分:1)

因为你的x在for循环之外。将您的代码更改为此 -

import requests
from bs4 import BeautifulSoup

res_dict = {}
for year in range(1991, 1994, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url)

    soup = BeautifulSoup(source.content, "html.parser")
    res_dict[year] = soup.text

答案 1 :(得分:0)

因为for会循环代码,并且......让我们看一个例子:

for year in range(1991, 2000, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url) 

现在,第一次循环urlhttps://en.wikipedia.org/wiki/1991。 第二次,urlhttps://en.wikipedia.org/wiki/1992

最后一次,网址为https://en.wikipedia.org/wiki/1999。所以sourcerequests.get(https://en.wikipedia.org/wiki/1999)

如果您不了解我,可以尝试以下代码:

for i in range(1, 10):
    a = i
    print(a)

print(a)

所以x = BeautifulSoup(source.text, "html.parser")必须在for循环中,如下所示:

for year in range(1991, 2000, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url)

    x = BeautifulSoup(source.text, "html.parser")
    x